Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert the latest dataset to views #141

Closed
rviscomi opened this issue Sep 21, 2022 · 16 comments
Closed

Convert the latest dataset to views #141

rviscomi opened this issue Sep 21, 2022 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@rviscomi
Copy link
Member

We have scheduled queries that generate the tables in the httparchive.latest dataset. These tables currently have no content, due to an unknown bug. Can we leverage views and the new partitioned all dataset to make this process more streamlined and maintenance-free?

@tunetheweb
Copy link
Member

As investigated in #142 this should be possible, with the exception that they cannot be used in wildcard queries.

My main concerns would be:

  • The wildcard issue
  • The all datastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.

One possibility is to create a httparchive.all.latest_date table with one date (e.g. 2022-09-01) and then create the view using something like:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
  SELECT
    date,
    client
    ....
  FROM
    `httparchive.all.pages`
  JOIN
    `httparchive.all.latest_date`
  USING (date)
  WHERE
    client = 'desktop'
);

And then could just update httparchive.all.latest_date at the end of each month's run to signal that data is now ready, and all the views referencing that would automatically switch to the new date.

I ran this, and it seemed to work:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_date` AS (
  SELECT CAST('2022-08-01' AS DATE) AS date
)

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  JOIN
    `httparchive.scratchspace.test_latest_date`
  USING (date)
  WHERE
    client = 'desktop'
);

and I get these costs:

SELECT page FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.13GB
SELECT page, rank FROM `httparchive.scratchspace.test_latest_desktop`; -- 1.27GB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop`; -- 4.79 TB
SELECT page, rank, payload FROM `httparchive.scratchspace.test_latest_desktop` WHERE rank = 1000; -- 532.58MB

@rviscomi
Copy link
Member Author

My main concerns would be:

  • The wildcard issue
  • The all datastream is streaming, so would need to update the views after that's finished to avoid people querying half datasets.

Could you clarify if the wildcard issue applies to the all dataset? Unlike the dated tables like pages.2022_09_01 that need wildcards like pages.* to process multiple dates, the all.pages table is partitioned by date.

Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.

@tunetheweb
Copy link
Member

Could you clarify if the wildcard issue applies to the all dataset? Unlike the dated tables like pages.2022_09_01 that need wildcards like pages.* to process multiple dates, the all.pages table is partitioned by date.

No, the wildcard issue only applies if And httparchive.latest.summary_pages_desktop and httparchive.latest.summary_pages_mobile are views on the all dataset and you want to run this:

SELECT
  _TABLE_SUFFIX AS client,
  col1
FROM
  `httparchive.latest.summary_pages_*`

That will not work, as it's using a wildcard on two views.

But if you did this to query one table, it would work fine:

SELECT
  col1
FROM
  ``httparchive.latest.summary_pages_desktop

Also, we've recently switched from streaming to batch inserts due to maintenance and data quality complexities, so partial datasets are no longer a concern.

Oh yeah keep forgetting this!

We'd still need to redefine the latest views each month as part of the batch after the data is loaded to look at the latest date.

I tried doing this, but it didn't work:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    client = 'desktop' AND
    date = (SELECT max(date) from `httparchive.all.pages`)
);

SELECT client FROM `httparchive.scratchspace.test_latest_desktop2`;

The last SELECT complains of a missing required date param.

Where as hard coding the date (AND date = 2022-08-01') in the view definition, or joining to a latest_date` table, allows the last SELECT to run fine.

@rviscomi
Copy link
Member Author

rviscomi commented Sep 22, 2022

The all tables are clustered by client, so is there any reason not to have a single view for each legacy table type, for example latest.summary_pages? The user could filter to a client with WHERE client = 'desktop' if needed. Based on your other experiment, it seems like it should have the same performance gains as applying the filter on the clustered table itself.

Alternatively, users can UNION ALL both desktop/mobile views together if needed.

The last SELECT complains of a missing required date param.

Try something like

SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'

But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:

SELECT
  MAX(partition_id)
FROM
  `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
WHERE
  table_name = 'pages' AND
  partition_id != '__NULL__'

@tunetheweb
Copy link
Member

Alternatively, users can UNION ALL both desktop/mobile views together if needed.

Don't think that would work as need _TABLE_SUFFIX to get the client. Could add client column to latest views, but seems a little redundant and still would require the changes (change to UNION ALL and use client column). Not the worst, but a change from what's there currently...

Try something like

SELECT max(date) from `httparchive.all.pages` WHERE date > '2000-01-01'

Same error:

Query error: Cannot query over table 'httparchive.all.pages' without a filter over column(s) 'date' that can be used for partition elimination at [16:1]

But I think there's a better way using INFORMATION_SCHEMA.PARTITIONS to query the metadata directly:

I'm not sure how to use this info to create the view though?

@rviscomi
Copy link
Member Author

This validates:

  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(MAX(partition_id) AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__')
  AND client = 'desktop'

date IS NOT NULL seems to satisfy the validator

Not tested

@tunetheweb
Copy link
Member

Close! This works:

CREATE OR REPLACE VIEW `httparchive.scratchspace.test_latest_desktop2` AS (
  SELECT
    date,
    client,
    page,
    rank,
    payload
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__')
  AND client = 'desktop'
);

select client from `httparchive.scratchspace.test_latest_desktop2`;

image

Only 322MB processed!

@tunetheweb
Copy link
Member

And adding rank is even quicker:

select date from `httparchive.scratchspace.test_latest_desktop2` where rank = 1000;

image

@rviscomi
Copy link
Member Author

REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3')

Nit/tip: use r'' to avoid escaping the capture groups r'\1-\2-\3'

And adding rank is even quicker:

So it looks like clustering does affect performance of the views, which is great. In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest tables/views.

@tunetheweb
Copy link
Member

In that case I don't see a reason to continue distinguishing between desktop/mobile in the latest tables/views.

Well that removes the wildcard issue!

Though it is a breaking change. But does anyone even use the latest tables? I never do as prefer to be explicit. Guess we’ll find out when we make this change…

@rviscomi
Copy link
Member Author

Yeah I filed this issue in response to a DM from a Googler trying to use one of the latest tables. We could have some sort of deprecation period when we support old and new versions of the views, and announce the timeline on the forum/social.

@romaincurutchet
Copy link

Hi Rick and Barry, is there an update on enabling the .latest table view?
Thank you.

@rviscomi
Copy link
Member Author

@romaincurutchet not yet sorry. Which views are you querying? We can create a new experimental view to unblock you and verify that the proposed approach works.

@tunetheweb
Copy link
Member

I have created three new views in the meantime:

  • httparchive.latest.pages
  • httparchive.latest.requests
  • httparchive.latest.lighthouse

These will automatically point to the latest month of data.

These tables are slightly different to the current tables (as well as not being split by desktop and mobile - use the client column to restrict those) but hopefully should be easy to convert your queries to these new ones. Note there are some extra columns in these, which might make them easier (and cheaper and quicker!) to query if you use them.

We're still figuring our the final schema, so this is subject to change, but hopefully that unblocks you for now @romaincurutchet

@romaincurutchet
Copy link

Thank you both!

@max-ostapenko
Copy link
Contributor

I just checked that the views give pretty good estimation for queries with cluster filters.
So the current solution is the best. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants