Convert the `latest` dataset to views #141

rviscomi · 2022-09-21T18:32:36Z

We have scheduled queries that generate the tables in the httparchive.latest dataset. These tables currently have no content, due to an unknown bug. Can we leverage views and the new partitioned all dataset to make this process more streamlined and maintenance-free?

The text was updated successfully, but these errors were encountered:

tunetheweb · 2022-09-21T20:56:40Z

As investigated in #142 this should be possible, with the exception that they cannot be used in wildcard queries.

My main concerns would be:

The wildcard issue
The all datastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.

One possibility is to create a httparchive.all.latest_date table with one date (e.g. 2022-09-01) and then create the view using something like:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
  SELECT
    date,
    client
    ....
  FROM
    `httparchive.all.pages`
  JOIN
    `httparchive.all.latest_date`
  USING (date)
  WHERE
    client = 'desktop'
);

And then could just update httparchive.all.latest_date at the end of each month's run to signal that data is now ready, and all the views referencing that would automatically switch to the new date.

I ran this, and it seemed to work:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_date` AS (
  SELECT CAST('2022-08-01' AS DATE) AS date
)

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  JOIN
    `httparchive.scratchspace.test_latest_date`
  USING (date)
  WHERE
    client = 'desktop'
);

and I get these costs:

SELECT page FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.13GB
SELECT page, rank FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.27GB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop`; -- 4.79 TB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop` WHERE rank = 1000; -- 532.58MB

rviscomi · 2022-09-22T20:15:58Z

My main concerns would be:

The wildcard issue

The all datastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.

Could you clarify if the wildcard issue applies to the all dataset? Unlike the dated tables like pages.2022_09_01 that need wildcards like pages.* to process multiple dates, the all.pages table is partitioned by date.

Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.

tunetheweb · 2022-09-22T20:59:43Z

Could you clarify if the wildcard issue applies to the all dataset? Unlike the dated tables like pages.2022_09_01 that need wildcards like pages.* to process multiple dates, the all.pages table is partitioned by date.

No, the wildcard issue only applies if And httparchive.latest.summary_pages_desktop and httparchive.latest.summary_pages_mobile are views on the all dataset and you want to run this:

SELECT
  _TABLE_SUFFIX AS client,
  col1
FROM
  `httparchive.latest.summary_pages_*`

That will not work, as it's using a wildcard on two views.

But if you did this to query one table, it would work fine:

SELECT
  col1
FROM
  ``httparchive.latest.summary_pages_desktop

Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.

Oh yeah keep forgetting this!

We'd still need to redefine the latest views each month as part of the batch after the data is loaded to look at the latest date.

I tried doing this, but it didn't work:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    client = 'desktop' AND
    date = (SELECT max(date) from `httparchive.all.pages`)
);

SELECT client FROM `httparchive.scratchspace.test_latest_desktop2`;

The last SELECT complains of a missing required date param.

Where as hard coding the date (AND date = 2022-08-01') in the view definition, or joining to a latest_date` table, allows the last SELECT to run fine.

rviscomi · 2022-09-22T21:45:09Z

The all tables are clustered by client, so is there any reason not to have a single view for each legacy table type, for example latest.summary_pages? The user could filter to a client with WHERE client = 'desktop' if needed. Based on your other experiment, it seems like it should have the same performance gains as applying the filter on the clustered table itself.

Alternatively, users can UNION ALL both desktop/mobile views together if needed.

The last SELECT complains of a missing required date param.

Try something like

SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'

But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:

SELECT
  MAX(partition_id)
FROM
  `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
  table_name = 'pages' AND
  partition_id != '__NULL__'

tunetheweb · 2022-09-22T22:00:38Z

Alternatively, users can UNION ALL both desktop/mobile views together if needed.

Don't think that would work as need _TABLE_SUFFIX to get the client. Could add client column to latest views, but seems a little redundant and still would require the changes (change to UNION ALL and use client column). Not the worst, but a change from what's there currently...

Try something like

SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'

Same error:

Query error: Cannot query over table 'httparchive.all.pages' without a filter over column(s) 'date' that can be used for partition elimination at [16:1]

But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:

I'm not sure how to use this info to create the view though?

rviscomi · 2022-09-22T22:11:23Z

This validates:

  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(MAX(partition_id) AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__')
  AND client = 'desktop'

date IS NOT NULL seems to satisfy the validator

Not tested

tunetheweb · 2022-09-22T22:29:43Z

Close! This works:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__')
  AND client = 'desktop'
);

select client from `httparchive.scratchspace.test_latest_desktop2`;

Only 322MB processed!

tunetheweb · 2022-09-22T22:31:33Z

And adding rank is even quicker:

select date from `httparchive.scratchspace.test_latest_desktop2` where rank = 1000;

rviscomi · 2022-09-23T00:03:14Z

REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3')

Nit/tip: use r'' to avoid escaping the capture groups r'\1-\2-\3'

And adding rank is even quicker:

So it looks like clustering does affect performance of the views, which is great. In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest tables/views.

tunetheweb · 2022-09-23T00:22:03Z

In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest tables/views.

Well that removes the wildcard issue!

Though it is a breaking change. But does anyone even use the latest tables? I never do as prefer to be explicit. Guess we’ll find out when we make this change…

rviscomi · 2022-09-23T00:24:40Z

Yeah I filed this issue in response to a DM from a Googler trying to use one of the latest tables. We could have some sort of deprecation period when we support old and new versions of the views, and announce the timeline on the forum/social.

romaincurutchet · 2022-10-21T16:34:49Z

Hi Rick and Barry, is there an update on enabling the .latest table view?
Thank you.

rviscomi · 2022-10-21T16:42:19Z

@romaincurutchet not yet sorry. Which views are you querying? We can create a new experimental view to unblock you and verify that the proposed approach works.

tunetheweb · 2022-10-21T18:48:26Z

I have created three new views in the meantime:

httparchive.latest.pages
httparchive.latest.requests
httparchive.latest.lighthouse

These will automatically point to the latest month of data.

These tables are slightly different to the current tables (as well as not being split by desktop and mobile - use the client column to restrict those) but hopefully should be easy to convert your queries to these new ones. Note there are some extra columns in these, which might make them easier (and cheaper and quicker!) to query if you use them.

We're still figuring our the final schema, so this is subject to change, but hopefully that unblocks you for now @romaincurutchet

romaincurutchet · 2022-10-25T20:04:25Z

Thank you both!

max-ostapenko · 2024-10-18T22:22:55Z

I just checked that the views give pretty good estimation for queries with cluster filters.
So the current solution is the best. Closing.

rviscomi added the enhancement New feature or request label Sep 21, 2022

rviscomi added this to the M3: Improving the analysis pipeline milestone Sep 21, 2022

rviscomi assigned rviscomi and paulcalvano Sep 21, 2022

rviscomi assigned tunetheweb Sep 21, 2022

This was referenced Oct 21, 2022

Investigate views on all dataset #142

Open

Create auto updating sample_data queries #150

Closed

max-ostapenko mentioned this issue Sep 9, 2024

Update the "latest" tables from Dataflow HTTPArchive/bigquery#81

Closed

max-ostapenko closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert the `latest` dataset to views #141

Convert the `latest` dataset to views #141

rviscomi commented Sep 21, 2022

tunetheweb commented Sep 21, 2022

rviscomi commented Sep 22, 2022

tunetheweb commented Sep 22, 2022

rviscomi commented Sep 22, 2022 •

edited

Loading

tunetheweb commented Sep 22, 2022

rviscomi commented Sep 22, 2022

tunetheweb commented Sep 22, 2022

tunetheweb commented Sep 22, 2022

rviscomi commented Sep 23, 2022

tunetheweb commented Sep 23, 2022

rviscomi commented Sep 23, 2022

romaincurutchet commented Oct 21, 2022

rviscomi commented Oct 21, 2022

tunetheweb commented Oct 21, 2022

romaincurutchet commented Oct 25, 2022

max-ostapenko commented Oct 18, 2024

Convert the latest dataset to views #141

Convert the latest dataset to views #141

Comments

rviscomi commented Sep 21, 2022

tunetheweb commented Sep 21, 2022

rviscomi commented Sep 22, 2022

tunetheweb commented Sep 22, 2022

rviscomi commented Sep 22, 2022 • edited Loading

tunetheweb commented Sep 22, 2022

rviscomi commented Sep 22, 2022

tunetheweb commented Sep 22, 2022

tunetheweb commented Sep 22, 2022

rviscomi commented Sep 23, 2022

tunetheweb commented Sep 23, 2022

rviscomi commented Sep 23, 2022

romaincurutchet commented Oct 21, 2022

rviscomi commented Oct 21, 2022

tunetheweb commented Oct 21, 2022

romaincurutchet commented Oct 25, 2022

max-ostapenko commented Oct 18, 2024

Convert the `latest` dataset to views #141

Convert the `latest` dataset to views #141

rviscomi commented Sep 22, 2022 •

edited

Loading