Update the "latest" tables from Dataflow #81

rviscomi · 2020-06-03T21:13:17Z

Forked from #76

Currently we use scheduled queries to scan each dataset/client combo for the latest release and save that to its respective latest.<dataset>_<client> table.

For example, here's the scheduled query that generates the latest.response_bodies_mobile table:

#standardSQL
SELECT
  *
FROM
  `httparchive.response_bodies.*`
WHERE
  ENDS_WITH(_TABLE_SUFFIX, 'mobile') AND
  SUBSTR(_TABLE_SUFFIX, 0, 10) = (
  SELECT
    SUBSTR(table_id, 0, 10) AS date
  FROM
    `httparchive.response_bodies.__TABLES_SUMMARY__`
  ORDER BY
    table_id DESC
  LIMIT
    1)

BigQuery usually has some heuristics to help minimize the number of bytes processed by a query if the WHERE clause clearly limits the _TABLE_SUFFIX pseudocolumn to a particular table. But I'm not sure if that's happening here because the estimated cost of this query is over $1000 (200 TB): This query will process 202.9 TB when run..

Queries for each dataset/client combo are scheduled to run on the first couple of days of every month. They become more expensive over time as we add new tables to every dataset.

A much more efficient approach would be to overwrite the latest.* tables in the Dataflow pipeline when we create the tables for each release. Rather than updating the deprecated Java pipeline, add this as a feature to #79.

The text was updated successfully, but these errors were encountered:

max-ostapenko · 2024-08-22T13:13:38Z

@rviscomi I see dataset latest wasn't updated since 2022.
Maybe we could offer a guide on how to use latest and sampled data from all?

tunetheweb · 2024-08-22T13:29:22Z

The httparchive.latest.pages and httparchive.latest.requests are views onto the all datasets so do point to the latest (and weirdly the same for httparchive.latest.lighthouse which is a subset of httparchive.latest.pages).

The other tables are switched off in Scheduled Queries. I can't remember why we did that, but given we want to encourage people off the old data model anyway, and I'm not aware of any complaints, I'm not inclined to re-enable them for the old tables. Maybe we should remove them to make that more obvious.

It's similar for the sample_data datasets. And I definitely think we should promote those more!

max-ostapenko · 2024-08-22T15:57:33Z

Agree, removing outdated tables will definitely shift the focus to the new ones.

These are at least up-to-date:

sample_data.pages_1k
sample_data.requests_1k
latest.lighthouse
latest.pages
latest.requests

I'm not sure if the views really do the job of introduction.

SELECT page, client, custom_metrics FROM `httparchive.latest.pages` LIMIT 1000 says 333TB to be processed.
Adding a date filter brings it down to 63TB.
TABLESAMPLE doesn't work with views.

So unless I'm familiar with the data I would never run the query.

There are advantages in having these as tables:

transparent data volumes,
even more cost optimization,
no query limitations.

Cost safety will help promoting to learn the data.

tunetheweb · 2024-08-22T17:19:32Z

Yeah estimates are all off on views.

You can prove it here:

--  Says it's gonna cost 21.01 GB even though it only takes 2.14 GB
SELECT page, client, custom_metrics FROM `httparchive.latest.pages` WHERE rank = 1000

--  Says it's gonna cost 2.14 GB which it does
SELECT page, client, custom_metrics FROM `httparchive.all.pages` WHERE date = '2024-08-01' And rank = 1000

And yes that means they aren't great cause you're asking users to take a leap! Wish BigQuery would fix these bugs:

FYI we created the views here after a request: HTTPArchive/data-pipeline#141

There are advantages in having these as tables:

transparent data volumes,

even more cost optimization,

no query limitations.

Cost safety will help promoting to learn the data.

I definitely agree with that for the sample data tables. Less so for the latest tables (they are still VERY big tables!)

max-ostapenko · 2024-09-09T23:32:05Z

I removed the empty tables (with legacy schemas) from the latest dataset, so we have only views there now.

Let's close this old issue, and continue in HTTPArchive/data-pipeline#141 then.

rviscomi added the enhancement label Jun 3, 2020

rviscomi self-assigned this Jun 3, 2020

max-ostapenko closed this as completed Sep 9, 2024

max-ostapenko closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the "latest" tables from Dataflow #81

Update the "latest" tables from Dataflow #81

rviscomi commented Jun 3, 2020

max-ostapenko commented Aug 22, 2024

tunetheweb commented Aug 22, 2024

max-ostapenko commented Aug 22, 2024

tunetheweb commented Aug 22, 2024 •

edited by max-ostapenko

Loading

max-ostapenko commented Sep 9, 2024

Update the "latest" tables from Dataflow #81

Update the "latest" tables from Dataflow #81

Comments

rviscomi commented Jun 3, 2020

max-ostapenko commented Aug 22, 2024

tunetheweb commented Aug 22, 2024

max-ostapenko commented Aug 22, 2024

tunetheweb commented Aug 22, 2024 • edited by max-ostapenko Loading

max-ostapenko commented Sep 9, 2024

tunetheweb commented Aug 22, 2024 •

edited by max-ostapenko

Loading