-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the "latest" tables from Dataflow #81
Comments
@rviscomi I see dataset |
The The other tables are switched off in Scheduled Queries. I can't remember why we did that, but given we want to encourage people off the old data model anyway, and I'm not aware of any complaints, I'm not inclined to re-enable them for the old tables. Maybe we should remove them to make that more obvious. It's similar for the |
These are at least up-to-date:
So unless I'm familiar with the data I would never run the query. There are advantages in having these as tables:
Cost safety will help promoting to learn the data. |
Yeah estimates are all off on views. You can prove it here: -- Says it's gonna cost 21.01 GB even though it only takes 2.14 GB
SELECT page, client, custom_metrics FROM `httparchive.latest.pages` WHERE rank = 1000
-- Says it's gonna cost 2.14 GB which it does
SELECT page, client, custom_metrics FROM `httparchive.all.pages` WHERE date = '2024-08-01' And rank = 1000 And yes that means they aren't great cause you're asking users to take a leap! Wish BigQuery would fix these bugs: FYI we created the views here after a request: HTTPArchive/data-pipeline#141
I definitely agree with that for the sample data tables. Less so for the |
I removed the empty tables (with legacy schemas) from the Let's close this old issue, and continue in HTTPArchive/data-pipeline#141 then. |
Forked from #76
Currently we use scheduled queries to scan each dataset/client combo for the latest release and save that to its respective
latest.<dataset>_<client>
table.For example, here's the scheduled query that generates the
latest.response_bodies_mobile
table:BigQuery usually has some heuristics to help minimize the number of bytes processed by a query if the WHERE clause clearly limits the _TABLE_SUFFIX pseudocolumn to a particular table. But I'm not sure if that's happening here because the estimated cost of this query is over $1000 (200 TB):
This query will process 202.9 TB when run.
.Queries for each dataset/client combo are scheduled to run on the first couple of days of every month. They become more expensive over time as we add new tables to every dataset.
A much more efficient approach would be to overwrite the latest.* tables in the Dataflow pipeline when we create the tables for each release. Rather than updating the deprecated Java pipeline, add this as a feature to #79.
The text was updated successfully, but these errors were encountered: