Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable all.requests #5

Merged
merged 97 commits into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from 96 commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
e769d00
pages and legacy lighthouse
max-ostapenko Aug 22, 2024
2933e2a
fix
max-ostapenko Aug 22, 2024
d7da0be
first test
max-ostapenko Aug 22, 2024
bb79780
pages reference
max-ostapenko Aug 22, 2024
d11853f
pages insert
max-ostapenko Aug 22, 2024
326a062
date var
max-ostapenko Aug 22, 2024
0ca431f
js rewrite
max-ostapenko Aug 23, 2024
e325e74
dataset
max-ostapenko Aug 23, 2024
7f967d7
Initial commit
max-ostapenko Aug 26, 2024
8416aa1
init
max-ostapenko Aug 26, 2024
f7a217d
migrated to external repo
max-ostapenko Aug 26, 2024
f2bd92d
core_web_vitals
max-ostapenko Aug 26, 2024
1e81298
clean graph, tested
max-ostapenko Aug 26, 2024
e7ab825
publish core_web_vitals.technologies
max-ostapenko Aug 26, 2024
ee4119c
Dev (#1)
max-ostapenko Aug 26, 2024
59e6bbf
technologies partitioning
max-ostapenko Aug 26, 2024
a54bc56
sync
max-ostapenko Aug 26, 2024
13ea93b
past month date for cwv
max-ostapenko Aug 26, 2024
01ae1c5
8pm
max-ostapenko Aug 26, 2024
1b67347
package-lock.json
max-ostapenko Aug 26, 2024
b48d1c1
ignore full-refresh
max-ostapenko Aug 26, 2024
d4e0ef7
readme
max-ostapenko Aug 26, 2024
90374ae
updated tags and example assert
max-ostapenko Aug 26, 2024
66cb2dc
dependency assertions
max-ostapenko Aug 26, 2024
ecbbc40
current month commented
max-ostapenko Aug 26, 2024
ce2d190
assert fix
max-ostapenko Aug 27, 2024
f903e39
all tables publish
max-ostapenko Aug 27, 2024
e91433e
incremental tables
max-ostapenko Aug 27, 2024
e990365
node script
max-ostapenko Aug 27, 2024
df61fed
enable legacy
max-ostapenko Aug 27, 2024
40a78ab
Merge branch 'dev' into dev
max-ostapenko Aug 27, 2024
12e18a1
missing package name
max-ostapenko Aug 28, 2024
091e634
table configs
max-ostapenko Aug 28, 2024
48adc3f
all.requests and all.parsed_css
max-ostapenko Aug 28, 2024
98865c2
dev sampling vars
max-ostapenko Aug 29, 2024
89b5e5c
sampling instead of rank
max-ostapenko Aug 29, 2024
f6b3ca2
readme upd
max-ostapenko Aug 29, 2024
9d29f43
dev hints
max-ostapenko Aug 29, 2024
855af63
dev sampling for tech report
max-ostapenko Aug 29, 2024
375eb79
tech report workflow
max-ostapenko Aug 29, 2024
358f7eb
removed sampling
max-ostapenko Aug 29, 2024
b69d4fd
dates flexibility
max-ostapenko Aug 29, 2024
3ce87b4
fix
max-ostapenko Aug 29, 2024
c7d7b1e
formatting
max-ostapenko Aug 29, 2024
a4619d7
other legacy tables
max-ostapenko Aug 29, 2024
5d41d58
docs and dependencies
max-ostapenko Aug 29, 2024
f1b9310
comment
max-ostapenko Aug 29, 2024
b219de5
Merge branch 'main' into main
max-ostapenko Aug 29, 2024
fd47503
Update definitions/output/pages.js
max-ostapenko Aug 29, 2024
f5d9d56
Update definitions/output/technologies.js
max-ostapenko Aug 29, 2024
587e7ad
Update package.json
max-ostapenko Aug 29, 2024
d97bb48
Update workflow_settings.yaml
max-ostapenko Aug 29, 2024
20d7b3f
format
max-ostapenko Aug 29, 2024
0742be8
merge
max-ostapenko Aug 29, 2024
83cc367
not dependent on all.pages
max-ostapenko Aug 29, 2024
85f0846
migrated to function trigger
max-ostapenko Aug 31, 2024
4eeb851
cloud function
max-ostapenko Sep 1, 2024
8c593c3
readme update
max-ostapenko Sep 1, 2024
ceec022
Merge branch 'dev' into dev
max-ostapenko Sep 1, 2024
30a3304
deployed function
max-ostapenko Sep 1, 2024
ef97bbe
readme updates
max-ostapenko Sep 1, 2024
816b22e
readme update
max-ostapenko Sep 1, 2024
58a9841
init stable copies
max-ostapenko Sep 2, 2024
dae6e7c
requests ready
max-ostapenko Sep 3, 2024
a0148ae
adjusted requests pipeline
max-ostapenko Sep 3, 2024
141f989
use release configs in prod
max-ostapenko Sep 3, 2024
80ed090
readme update
max-ostapenko Sep 3, 2024
d4d8ee2
tags update
max-ostapenko Sep 3, 2024
56d39d3
Merge branch 'dev' into dev1
max-ostapenko Sep 3, 2024
44c830c
dev sampling
max-ostapenko Sep 3, 2024
b25ddee
prune summary
max-ostapenko Sep 4, 2024
c461e26
sorted
max-ostapenko Sep 4, 2024
a841016
false when target exists
max-ostapenko Sep 5, 2024
3434376
Merge branch 'main' into dev1
max-ostapenko Sep 9, 2024
a50d682
Merge branch 'main' into dev1
max-ostapenko Sep 9, 2024
75e57a7
dev sampling
max-ostapenko Sep 9, 2024
f059a1b
newline
max-ostapenko Sep 9, 2024
5d3665c
Merge branch 'dev1' into dev1
max-ostapenko Sep 9, 2024
ae9c90c
Merge branch 'main' into dev1
max-ostapenko Sep 10, 2024
115c6c0
trigger cleanup
max-ostapenko Sep 10, 2024
8c1416e
formatting
max-ostapenko Sep 10, 2024
9375d1c
Merge branch 'main' into dev1
max-ostapenko Sep 10, 2024
4ddfb42
Merge branch 'main' into main
max-ostapenko Sep 10, 2024
cebe4a1
Merge branch 'dev1' into dev1
max-ostapenko Sep 10, 2024
5a6fccf
Merge branch 'main' into main
max-ostapenko Sep 10, 2024
74e7918
forEach iteration
max-ostapenko Sep 10, 2024
2b06152
create table with operate
max-ostapenko Sep 10, 2024
fefe57f
new test tables script
max-ostapenko Sep 11, 2024
696a8b6
tested
max-ostapenko Sep 11, 2024
70f1a9c
merge
max-ostapenko Sep 18, 2024
f89ddcf
Merge branch 'main' into main
max-ostapenko Sep 18, 2024
654bcc7
JSON columns
max-ostapenko Sep 29, 2024
ca2c0ab
Merge branch 'main' into main
max-ostapenko Sep 29, 2024
33f58ac
job per client
max-ostapenko Sep 29, 2024
da27681
native object pruning
max-ostapenko Sep 30, 2024
0f40f2e
Merge branch 'main' into main
max-ostapenko Sep 30, 2024
5857323
Update definitions/output/all/reprocess_requests.js
max-ostapenko Oct 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# HTTP Archive BigQuery pipeline with Dataform

## Tables
This repo handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the `httparchive` dataset in BigQuery.

## Pipelines

The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the `main` branch is used on each triggered pipeline run.

### Crawl tables in `all` dataset

Expand Down
44 changes: 23 additions & 21 deletions definitions/extra/test_env.js
Original file line number Diff line number Diff line change
@@ -1,26 +1,28 @@
const two_months_ago = constants.fn_past_month(constants.fn_past_month(constants.current_month));
const date = constants.fn_past_month(constants.current_month);

operate("test_env", {
hasOutput: true,
disabled: true // MUST NOT be commented in main branch
}).queries(ctx => `
CREATE OR REPLACE TABLE ${ctx.ref("all", "pages")} AS
SELECT *
FROM httparchive.all.pages ${constants.dev_TABLESAMPLE}
WHERE date = '${two_months_ago}';
var resources_list = [
//{datasetId: "all", tableId: "pages"},
{datasetId: "all", tableId: "requests"},
//{datasetId: "all", tableId: "parsed_css"},
//{datasetId: "core_web_vitals", tableId: "technologies"},
];

CREATE OR REPLACE TABLE ${ctx.ref("all", "requests")} AS
SELECT *
FROM httparchive.all.requests ${constants.dev_TABLESAMPLE}
WHERE date = '${two_months_ago}';
resources_list.forEach(resource => {
operate(`test_table ${resource.datasetId}_${resource.tableId}`, {
disabled: !constants.is_dev_env // enabled when workflow variable env_name = "dev"
}).tags([
"test_tables"
]).queries(ctx => `
CREATE SCHEMA IF NOT EXISTS ${resource.datasetId}_dev;

CREATE OR REPLACE TABLE ${ctx.ref("all", "parsed_css")} AS
SELECT *
FROM httparchive.all.parsed_css ${constants.dev_TABLESAMPLE}
WHERE date = '${two_months_ago}';
DROP TABLE ${resource.datasetId}_dev.dev_${resource.tableId};

CREATE TABLE ${resource.datasetId}_dev.dev_${resource.tableId}
LIKE httparchive.${resource.datasetId}.${resource.tableId};

CREATE OR REPLACE TABLE ${ctx.ref("core_web_vitals", "technologies")} AS
INSERT INTO ${resource.datasetId}_dev.dev_${resource.tableId}
SELECT *
FROM httparchive.core_web_vitals.technologies
WHERE date = '${two_months_ago}'
`)
FROM httparchive.${resource.datasetId}.${resource.tableId} ${constants.dev_TABLESAMPLE}
WHERE date = '${date}'
`);
})
122 changes: 122 additions & 0 deletions definitions/output/all/reprocess_requests.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
operate(`all_requests_stable_pre`).tags(
["all_requests_stable"]
).queries(`
CREATE SCHEMA IF NOT EXISTS all_dev;

DROP TABLE IF EXISTS \`all_dev.requests_stable\`;

CREATE TABLE \`all_dev.requests_stable\`
(
date DATE NOT NULL OPTIONS(description="YYYY-MM-DD format of the HTTP Archive monthly crawl"),
client STRING NOT NULL OPTIONS(description="Test environment: desktop or mobile"),
page STRING NOT NULL OPTIONS(description="The URL of the page being tested"),
is_root_page BOOL OPTIONS(description="Whether the page is the root of the origin."),
root_page STRING NOT NULL OPTIONS(description="The URL of the root page being tested"),
rank INT64 OPTIONS(description="Site popularity rank, from CrUX"),
url STRING NOT NULL OPTIONS(description="The URL of the request"),
is_main_document BOOL NOT NULL OPTIONS(description="Whether this request corresponds with the main HTML document of the page, which is the first HTML request after redirects"),
type STRING OPTIONS(description="Simplified description of the type of resource (script, html, css, text, other, etc)"),
index INT64 OPTIONS(description="The sequential 0-based index of the request"),
payload JSON OPTIONS(description="JSON-encoded WebPageTest result data for this request"),
summary JSON OPTIONS(description="JSON-encoded summarization of request data"),
request_headers ARRAY<STRUCT<
name STRING OPTIONS(description="Request header name"),
value STRING OPTIONS(description="Request header value")
>> OPTIONS(description="Request headers"),
response_headers ARRAY<STRUCT<
name STRING OPTIONS(description="Response header name"),
value STRING OPTIONS(description="Response header value")
>> OPTIONS(description="Response headers"),
response_body STRING OPTIONS(description="Text-based response body")
)
PARTITION BY date
CLUSTER BY client, is_root_page, type, rank
OPTIONS(
require_partition_filter=true
);
`);

const iterations = [];
const clients = constants.clients;

for (
let month = constants.current_month;
month >= '2024-09-01'; // 2022-07-01
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
month = constants.fn_past_month(month)) {
clients.forEach((client) => {
iterations.push({
month: month,
client: client
})
})
}

iterations.forEach((iteration, i) => {
operate(`all_requests_stable ${iteration.month} ${iteration.client}`).tags(
["all_requests_stable"]
).dependencies([
i===0 ? "all_requests_stable_pre" : `all_requests_stable ${iterations[i-1].month} ${iterations[i-1].client}`
]).queries(ctx => `
INSERT INTO \`all_dev.requests_stable\`
SELECT
requests.date,
requests.client,
requests.page,
requests.is_root_page,
requests.root_page,
crux.rank,
requests.url,
requests.is_main_document,
requests.type,
requests.index,
JSON_REMOVE(
SAFE.PARSE_JSON(payload, wide_number_mode => 'round'),
'$._headers'
) AS payload,
JSON_REMOVE(
SAFE.PARSE_JSON(requests.summary, wide_number_mode => 'round'),
'$.crawlid',
'$.firstHtml',
'$.firstReq',
'$.pageid',
'$.req_accept_encoding',
'$.req_accept_language',
'$.req_accept',
'$.req_if_modified_since',
'$.req_if_none_match',
'$.req_referer',
'$.req_user_agent',
'$.reqOtherHeaders',
'$.requestid',
'$.resp_age',
'$.resp_cache_control',
'$.resp_content_length',
'$.resp_content_type',
'$.resp_date',
'$.resp_etag',
'$.resp_last_modified',
'$.resp_server',
'$.resp_vary',
'$.respOtherHeaders',
'$.startedDateTime',
'$.url',
'$.urlShort'
) as summary,
max-ostapenko marked this conversation as resolved.
Show resolved Hide resolved
requests.request_headers,
requests.response_headers,
requests.response_body
FROM (
SELECT *
FROM \`all.requests\` ${constants.dev_TABLESAMPLE}
WHERE date = '${iteration.month}'
AND client = '${iteration.client}') AS requests
LEFT JOIN (
SELECT DISTINCT
CONCAT(origin, '/') AS page,
experimental.popularity.rank AS rank
FROM ${ctx.resolve("chrome-ux-report", "experimental", "global")}
WHERE yyyymm = ${constants.fn_past_month(iteration.month).substring(0, 7).replace('-', '')}
) AS crux
ON requests.root_page = crux.page;
`)
});
28 changes: 14 additions & 14 deletions definitions/output/core_web_vitals/technologies.js
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ CREATE TEMP FUNCTION IS_NON_ZERO(good FLOAT64, needs_improvement FLOAT64, poor F
);

CREATE TEMP FUNCTION GET_LIGHTHOUSE_CATEGORY_SCORES(categories STRING)
RETURNS STRUCT<accessibility NUMERIC, best_practices NUMERIC, performance NUMERIC, pwa NUMERIC, seo NUMERIC>
RETURNS STRUCT<accessibility NUMERIC, best_practices NUMERIC, performance NUMERIC, pwa NUMERIC, seo NUMERIC>
LANGUAGE js AS '''
try {
const $ = JSON.parse(categories);
Expand Down Expand Up @@ -74,23 +74,23 @@ crux AS (
END AS rank,
CONCAT(origin, '/') AS root_page_url,
IF(device = 'desktop', 'desktop', 'mobile') AS client,

# CWV
IS_NON_ZERO(fast_fid, avg_fid, slow_fid) AS any_fid,
IS_GOOD(fast_fid, avg_fid, slow_fid) AS good_fid,
IS_NON_ZERO(small_cls, medium_cls, large_cls) AS any_cls,
IS_GOOD(small_cls, medium_cls, large_cls) AS good_cls,
IS_NON_ZERO(fast_lcp, avg_lcp, slow_lcp) AS any_lcp,
IS_GOOD(fast_lcp, avg_lcp, slow_lcp) AS good_lcp,

(IS_GOOD(fast_inp, avg_inp, slow_inp) OR fast_inp IS NULL) AND
IS_GOOD(small_cls, medium_cls, large_cls) AND
IS_GOOD(fast_lcp, avg_lcp, slow_lcp) AS good_cwv_2024,

(IS_GOOD(fast_fid, avg_fid, slow_fid) OR fast_fid IS NULL) AND
IS_GOOD(small_cls, medium_cls, large_cls) AND
IS_GOOD(fast_lcp, avg_lcp, slow_lcp) AS good_cwv_2023,

# WV
IS_NON_ZERO(fast_fcp, avg_fcp, slow_fcp) AS any_fcp,
IS_GOOD(fast_fcp, avg_fcp, slow_fcp) AS good_fcp,
Expand All @@ -114,7 +114,7 @@ technologies AS (
${ctx.resolve("all", "pages")},
UNNEST(technologies) AS technology
WHERE
date = '${past_month}' AND
date = '${past_month}' ${constants.dev_rank5000_filter} AND
technology.technology IS NOT NULL AND
technology.technology != ''
UNION ALL
Expand All @@ -125,7 +125,7 @@ UNION ALL
FROM
${ctx.resolve("all", "pages")}
WHERE
date = '${past_month}'
date = '${past_month}' ${constants.dev_rank5000_filter}
),

categories AS (
Expand All @@ -137,7 +137,7 @@ categories AS (
UNNEST(technologies) AS technology,
UNNEST(technology.categories) AS category
WHERE
date = '${past_month}'
date = '${past_month}' ${constants.dev_rank5000_filter}
GROUP BY
app
UNION ALL
Expand All @@ -149,7 +149,7 @@ UNION ALL
UNNEST(technologies) AS technology,
UNNEST(technology.categories) AS category
WHERE
date = '${past_month}' AND
date = '${past_month}' ${constants.dev_rank5000_filter} AND
client = 'mobile'
),

Expand All @@ -165,7 +165,7 @@ summary_stats AS (
FROM
${ctx.resolve("all", "pages")}
WHERE
date = '${past_month}'
date = '${past_month}' ${constants.dev_rank5000_filter}
),

lab_data AS (
Expand Down Expand Up @@ -206,7 +206,7 @@ SELECT
app,
client,
COUNT(0) AS origins,

# CrUX data
COUNTIF(good_fid) AS origins_with_good_fid,
COUNTIF(good_cls) AS origins_with_good_cls,
Expand All @@ -227,19 +227,19 @@ SELECT
SAFE_DIVIDE(COUNTIF(good_cwv_2024), COUNTIF(any_lcp AND any_cls)) AS pct_eligible_origins_with_good_cwv,
SAFE_DIVIDE(COUNTIF(good_cwv_2024), COUNTIF(any_lcp AND any_cls)) AS pct_eligible_origins_with_good_cwv_2024,
SAFE_DIVIDE(COUNTIF(good_cwv_2023), COUNTIF(any_lcp AND any_cls)) AS pct_eligible_origins_with_good_cwv_2023,

# Lighthouse data
APPROX_QUANTILES(accessibility, 1000)[OFFSET(500)] AS median_lighthouse_score_accessibility,
APPROX_QUANTILES(best_practices, 1000)[OFFSET(500)] AS median_lighthouse_score_best_practices,
APPROX_QUANTILES(performance, 1000)[OFFSET(500)] AS median_lighthouse_score_performance,
APPROX_QUANTILES(pwa, 1000)[OFFSET(500)] AS median_lighthouse_score_pwa,
APPROX_QUANTILES(seo, 1000)[OFFSET(500)] AS median_lighthouse_score_seo,

# Page weight stats
APPROX_QUANTILES(bytesTotal, 1000)[OFFSET(500)] AS median_bytes_total,
APPROX_QUANTILES(bytesJS, 1000)[OFFSET(500)] AS median_bytes_js,
APPROX_QUANTILES(bytesImg, 1000)[OFFSET(500)] AS median_bytes_image

FROM
lab_data
JOIN
Expand Down
8 changes: 7 additions & 1 deletion definitions/sources/declares.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,17 @@ for (const table of crux_tables) {
});

assert(`${table}_not_empty`).query(ctx => `
SELECT
SELECT
'No data for the specified date' AS error_message
FROM ${ctx.ref("chrome-ux-report", "materialized", table)}
WHERE yyyymm = ${past_month}
GROUP BY yyyymm
HAVING COUNT(1) = 0
`);
}

declare({
database: "chrome-ux-report",
schema: "experimental",
name: "global",
});
12 changes: 0 additions & 12 deletions src/dataform.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,6 @@ async function get_compilation_results(repoURI) {
compilationResult: {
releaseConfig: `${repoURI}/releaseConfigs/production`
}
}, dev_request = {
parent: repoURI,
compilationResult: {
gitCommitish: 'dev'
},
codeCompilationConfig: {
schemaSuffix: 'dev',
tablePrefix: 'dev',
vars: {
current_month: '2024-08-01',
},
}
};

console.log(`Creating Dataform compilation result: ${JSON.stringify(request, null, 2)}`);
Expand Down