Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the "latest" tables from Dataflow #81

Closed
rviscomi opened this issue Jun 3, 2020 · 5 comments
Closed

Update the "latest" tables from Dataflow #81

rviscomi opened this issue Jun 3, 2020 · 5 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented Jun 3, 2020

Forked from #76

Currently we use scheduled queries to scan each dataset/client combo for the latest release and save that to its respective latest.<dataset>_<client> table.

For example, here's the scheduled query that generates the latest.response_bodies_mobile table:

#standardSQL
SELECT
  *
FROM
  `httparchive.response_bodies.*`
WHERE
  ENDS_WITH(_TABLE_SUFFIX, 'mobile') AND
  SUBSTR(_TABLE_SUFFIX, 0, 10) = (
  SELECT
    SUBSTR(table_id, 0, 10) AS date
  FROM
    `httparchive.response_bodies.__TABLES_SUMMARY__`
  ORDER BY
    table_id DESC
  LIMIT
    1)

BigQuery usually has some heuristics to help minimize the number of bytes processed by a query if the WHERE clause clearly limits the _TABLE_SUFFIX pseudocolumn to a particular table. But I'm not sure if that's happening here because the estimated cost of this query is over $1000 (200 TB): This query will process 202.9 TB when run..

Queries for each dataset/client combo are scheduled to run on the first couple of days of every month. They become more expensive over time as we add new tables to every dataset.

A much more efficient approach would be to overwrite the latest.* tables in the Dataflow pipeline when we create the tables for each release. Rather than updating the deprecated Java pipeline, add this as a feature to #79.

@rviscomi rviscomi self-assigned this Jun 3, 2020
@max-ostapenko
Copy link

@rviscomi I see dataset latest wasn't updated since 2022.
Maybe we could offer a guide on how to use latest and sampled data from all?

@tunetheweb
Copy link
Member

The httparchive.latest.pages and httparchive.latest.requests are views onto the all datasets so do point to the latest (and weirdly the same for httparchive.latest.lighthouse which is a subset of httparchive.latest.pages).

The other tables are switched off in Scheduled Queries. I can't remember why we did that, but given we want to encourage people off the old data model anyway, and I'm not aware of any complaints, I'm not inclined to re-enable them for the old tables. Maybe we should remove them to make that more obvious.

It's similar for the sample_data datasets. And I definitely think we should promote those more!

@max-ostapenko
Copy link

  1. Agree, removing outdated tables will definitely shift the focus to the new ones.

These are at least up-to-date:

  • sample_data.pages_1k
  • sample_data.requests_1k
  • latest.lighthouse
  • latest.pages
  • latest.requests
  1. I'm not sure if the views really do the job of introduction.
  • SELECT page, client, custom_metrics FROM `httparchive.latest.pages` LIMIT 1000 says 333TB to be processed.
  • Adding a date filter brings it down to 63TB.
  • TABLESAMPLE doesn't work with views.

So unless I'm familiar with the data I would never run the query.

There are advantages in having these as tables:

  • transparent data volumes,
  • even more cost optimization,
  • no query limitations.

Cost safety will help promoting to learn the data.

@tunetheweb
Copy link
Member

tunetheweb commented Aug 22, 2024

Yeah estimates are all off on views.

You can prove it here:

--  Says it's gonna cost 21.01 GB even though it only takes 2.14 GB
SELECT page, client, custom_metrics FROM `httparchive.latest.pages` WHERE rank = 1000

--  Says it's gonna cost 2.14 GB which it does
SELECT page, client, custom_metrics FROM `httparchive.all.pages` WHERE date = '2024-08-01' And rank = 1000

And yes that means they aren't great cause you're asking users to take a leap! Wish BigQuery would fix these bugs:

FYI we created the views here after a request: HTTPArchive/data-pipeline#141

There are advantages in having these as tables:

  • transparent data volumes,
  • even more cost optimization,
  • no query limitations.

Cost safety will help promoting to learn the data.

I definitely agree with that for the sample data tables. Less so for the latest tables (they are still VERY big tables!)

@max-ostapenko
Copy link

I removed the empty tables (with legacy schemas) from the latest dataset, so we have only views there now.

Let's close this old issue, and continue in HTTPArchive/data-pipeline#141 then.

@max-ostapenko max-ostapenko closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants