-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert the latest
dataset to views
#141
Comments
As investigated in #142 this should be possible, with the exception that they cannot be used in wildcard queries. My main concerns would be:
One possibility is to create a CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
SELECT
date,
client
....
FROM
`httparchive.all.pages`
JOIN
`httparchive.all.latest_date`
USING (date)
WHERE
client = 'desktop'
); And then could just update I ran this, and it seemed to work: CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_date` AS (
SELECT CAST('2022-08-01' AS DATE) AS date
)
CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
JOIN
`httparchive.scratchspace.test_latest_date`
USING (date)
WHERE
client = 'desktop'
); and I get these costs: SELECT page FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.13GB
SELECT page, rank FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.27GB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop`; -- 4.79 TB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop` WHERE rank = 1000; -- 532.58MB |
Could you clarify if the wildcard issue applies to the Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern. |
No, the wildcard issue only applies if And SELECT
_TABLE_SUFFIX AS client,
col1
FROM
`httparchive.latest.summary_pages_*` That will not work, as it's using a wildcard on two views. But if you did this to query one table, it would work fine: SELECT
col1
FROM
``httparchive.latest.summary_pages_desktop
Oh yeah keep forgetting this! We'd still need to redefine the latest views each month as part of the batch after the data is loaded to look at the latest date. I tried doing this, but it didn't work: CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
client = 'desktop' AND
date = (SELECT max(date) from `httparchive.all.pages`)
);
SELECT client FROM `httparchive.scratchspace.test_latest_desktop2`; The last Where as hard coding the date ( |
The Alternatively, users can
Try something like SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01' But I think there's a better way using SELECT
MAX(partition_id)
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__' |
Don't think that would work as need
Same error:
I'm not sure how to use this info to create the view though? |
This validates: SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date IS NOT NULL AND
date = (
SELECT
CAST(MAX(partition_id) AS DATE) AS date
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__')
AND client = 'desktop'
Not tested |
Close! This works: CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
SELECT
date,
client,
page,
rank,
payload
FROM
`httparchive.all.pages`
WHERE
date IS NOT NULL AND
date = (
SELECT
CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
FROM
`httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
table_name = 'pages' AND
partition_id != '__NULL__')
AND client = 'desktop'
);
select client from `httparchive.scratchspace.test_latest_desktop2`; Only 322MB processed! |
Nit/tip: use
So it looks like clustering does affect performance of the views, which is great. In that case I don't see a reason to continue distinguishing between desktop/mobile in the |
Well that removes the wildcard issue! Though it is a breaking change. But does anyone even use the |
Yeah I filed this issue in response to a DM from a Googler trying to use one of the |
Hi Rick and Barry, is there an update on enabling the .latest table view? |
@romaincurutchet not yet sorry. Which views are you querying? We can create a new experimental view to unblock you and verify that the proposed approach works. |
I have created three new views in the meantime:
These will automatically point to the latest month of data. These tables are slightly different to the current tables (as well as not being split by desktop and mobile - use the We're still figuring our the final schema, so this is subject to change, but hopefully that unblocks you for now @romaincurutchet |
Thank you both! |
I just checked that the views give pretty good estimation for queries with cluster filters. |
We have scheduled queries that generate the tables in the
httparchive.latest
dataset. These tables currently have no content, due to an unknown bug. Can we leverage views and the new partitionedall
dataset to make this process more streamlined and maintenance-free?The text was updated successfully, but these errors were encountered: