Stable `all.requests` #5

max-ostapenko · 2024-09-02T17:07:24Z

Schema changes

summary column trimmed of a number of metrics (Closes The new schema and cost concerns for users data-pipeline#149)
payload column trimmed of headers
rank data added and used for clustering (Resolves Add rank field to all.requests data-pipeline#189, Closes Consider clustering all.requests table by page or rank data-pipeline#263)
payload and summary columns are of JSON type

Intermediate table

all.requests_stable

Run reprocessing jobs with all_requests_stable tag.

After reprocessing

DROP TABLE `all.requests`;

CREATE TABLE `all.requests`
COPY `all_dev.requests_stable`;

* workspace init * pages and legacy lighthouse * fix * first test * pages insert * date var * js rewrite * dataset * core_web_vitals * clean graph, tested * publish core_web_vitals.technologies

max-ostapenko · 2024-09-11T00:53:04Z

Test run

Original and processed table

max-ostapenko · 2024-09-27T21:27:42Z

Would really appreciate review here, as I'm almost ready with backfill PR #10.

definitions/output/all/reprocess_requests.js

tunetheweb · 2024-09-30T09:32:00Z

Would really appreciate review here, as I'm almost ready with backfill PR #10.

Hey @max-ostapenko sorry for radio silence here. I mistakenly presumed this was discussed on the last HTTP Archive maintainers call (that I missed) but just found out that @rviscomi missed this too so think we both presumed the other was dealing with this.

I think it would be good to discuss this on a call (whether the next planned one on 15th October or before). Questions/concerns I have include:

Is this definitely all the changes we plan?
What happens while this is running? Is the table not accessible? Or will it continue to be accessible from the run as each data is back-populated?
How long it will take to run?
Any impact on the next crawl (assuming it's not complete by then)?
Any impact on Web Almanac and if we're better to wait until majority of analysis is completed there?

max-ostapenko · 2024-09-30T10:46:14Z

I think it would be good to discuss this on a call (whether the next planned one on 15th October or before). Questions/concerns I have include:

I would definitely like to discuss it before the next crawl. And notify users of the upcoming changes beforehand.

Is this definitely all the changes we plan?

Please drop here any issues or comments that I've missed.

What happens while this is running? Is the table not accessible? Or will it continue to be accessible from the run as each data is back-populated?

The reprocessing will write into all_dev.requests_stable.

Once the reprocessing is done we can:

run checks on the new table,
test all the production queries changes,
notify users with the exact date,
update Almanac queries from this and last years,

and eventually replace the live table with CREATE TABLE COPY DDL. So no downtime.

How long it will take to run?

We can only estimate by running the first month.

In September 2024 it took 6h for requests tables without any transformations.
Let's assume it would take 2 times longer for this one - 12h.
So ~14 days.

Any impact on the next crawl (assuming it's not complete by then)?

Ideally we start this before the new crawl starts (so that we can already have it synced with new schema and cleaned objects to just run INSERT INTO SELECT *).

If anything not yet synced in WPT agent before the crawl, we can always cleanup the last crawl data within the pipeline.

Any impact on Web Almanac and if we're better to wait until majority of analysis is completed there?

Queries reading from struct columns need to be adjusted to a new schema to be valid.
All the analysis PRs are still pending, I'll need to go through them, and push the review where the articles were written.
Then we merge adjustments to the published queries at the moment we replace the live table. It will give a smooth experience for readers.

tunetheweb · 2024-10-07T13:56:08Z

We could pull _font_details stuff into it's own column? However I think it's niche enough (and the payload will be small enough with above changes) that it's OK to leave in the payload column. WDYT @rviscomi ?

    "_font_details": {
        "table_sizes": {
            "GDEF": 100,
            "GPOS": 3264,
            "GSUB": 458,
            "OS/2": 96,
...
        "counts": {
            "num_cmap_codepoints": 215,
            "num_glyphs": 238
        }
    },

tunetheweb · 2024-10-07T13:59:35Z

We need to look at native JSON columns as there were some that couldn't be processed in that and so had to use JavaScript JSON columns. See: HTTPArchive/httparchive.org#923 (comment). Is this going to be a problem if we move to native JSON columns?

tunetheweb · 2024-10-07T14:35:06Z

Ignore. SAFE.PARSE_JSON(payload, wide_number_mode => 'round') is the answer as you already implemented.

tunetheweb

LGTM with one small request to leave the crawlids and pagesid in summary.

definitions/output/all/reprocess_requests.js

rviscomi · 2024-10-07T15:46:26Z

We could pull _font_details stuff into it's own column? However I think it's niche enough (and the payload will be small enough with above changes) that it's OK to leave in the payload column. WDYT @rviscomi ?

Agreed. It could work, but I'm happy to leave it as is.

Co-authored-by: Barry Pollard <barrypollard@google.com>

max-ostapenko and others added 30 commits August 22, 2024 22:03

pages and legacy lighthouse

e769d00

fix

2933e2a

first test

d7da0be

pages reference

bb79780

pages insert

d11853f

date var

326a062

js rewrite

0ca431f

dataset

e325e74

Initial commit

7f967d7

init

8416aa1

migrated to external repo

f7a217d

core_web_vitals

f2bd92d

clean graph, tested

1e81298

publish core_web_vitals.technologies

e7ab825

Dev (#1)

ee4119c

* workspace init * pages and legacy lighthouse * fix * first test * pages insert * date var * js rewrite * dataset * core_web_vitals * clean graph, tested * publish core_web_vitals.technologies

technologies partitioning

59e6bbf

sync

a54bc56

past month date for cwv

13ea93b

8pm

01ae1c5

package-lock.json

1b67347

ignore full-refresh

b48d1c1

readme

d4e0ef7

updated tags and example assert

90374ae

dependency assertions

66cb2dc

current month commented

ecbbc40

assert fix

ce2d190

all tables publish

f903e39

incremental tables

e91433e

node script

e990365

enable legacy

df61fed

max-ostapenko and others added 9 commits September 10, 2024 15:27

formatting

8c1416e

Merge branch 'main' into dev1

9375d1c

Merge branch 'main' into main

4ddfb42

Merge branch 'dev1' into dev1

cebe4a1

Merge branch 'main' into main

5a6fccf

forEach iteration

74e7918

create table with operate

2b06152

new test tables script

fefe57f

tested

696a8b6

max-ostapenko added 2 commits September 18, 2024 21:49

merge

70f1a9c

Merge branch 'main' into main

f89ddcf

max-ostapenko mentioned this pull request Sep 19, 2024

Datasets and schema updates HTTPArchive/har.fyi#15

Merged

max-ostapenko added 2 commits September 29, 2024 17:53

JSON columns

654bcc7

Merge branch 'main' into main

ca2c0ab

max-ostapenko commented Sep 29, 2024

View reviewed changes

definitions/output/all/reprocess_requests.js Show resolved Hide resolved

job per client

33f58ac

max-ostapenko added 2 commits September 30, 2024 22:23

native object pruning

da27681

Merge branch 'main' into main

0f40f2e

tunetheweb approved these changes Oct 7, 2024

View reviewed changes

definitions/output/all/reprocess_requests.js Outdated Show resolved Hide resolved

Update definitions/output/all/reprocess_requests.js

5857323

Co-authored-by: Barry Pollard <barrypollard@google.com>

max-ostapenko merged commit 94718f9 into main Oct 7, 2024
3 checks passed

max-ostapenko deleted the dev1 branch October 7, 2024 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable `all.requests` #5

Stable `all.requests` #5

max-ostapenko commented Sep 2, 2024 •

edited

Loading

max-ostapenko commented Sep 11, 2024

max-ostapenko commented Sep 27, 2024

tunetheweb commented Sep 30, 2024

max-ostapenko commented Sep 30, 2024

tunetheweb commented Oct 7, 2024

tunetheweb commented Oct 7, 2024

tunetheweb commented Oct 7, 2024

tunetheweb left a comment

rviscomi commented Oct 7, 2024 •

edited

Loading

Stable all.requests #5

Stable all.requests #5

Conversation

max-ostapenko commented Sep 2, 2024 • edited Loading

Schema changes

Intermediate table

After reprocessing

max-ostapenko commented Sep 11, 2024

max-ostapenko commented Sep 27, 2024

tunetheweb commented Sep 30, 2024

max-ostapenko commented Sep 30, 2024

tunetheweb commented Oct 7, 2024

tunetheweb commented Oct 7, 2024

tunetheweb commented Oct 7, 2024

tunetheweb left a comment

Choose a reason for hiding this comment

rviscomi commented Oct 7, 2024 • edited Loading

Stable `all.requests` #5

Stable `all.requests` #5

max-ostapenko commented Sep 2, 2024 •

edited

Loading

rviscomi commented Oct 7, 2024 •

edited

Loading