Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create auto updating sample_data queries #150

Closed
tunetheweb opened this issue Oct 21, 2022 · 5 comments · Fixed by HTTPArchive/dataform#7
Closed

Create auto updating sample_data queries #150

tunetheweb opened this issue Oct 21, 2022 · 5 comments · Fixed by HTTPArchive/dataform#7

Comments

@tunetheweb
Copy link
Member

These could be views on the latest all tables - similar to latest tables in #141 but with a reduced dataset.

The rank=1000 websites might be a good fit here. Less random, but maybe that's a good thing?

@tunetheweb
Copy link
Member Author

tunetheweb commented Oct 21, 2022

For pages this is easy to create as a view:

CREATE OR REPLACE VIEW `httparchive.sample_data.pages_1k` AS (
  SELECT
    *
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__') AND
    rank = 1000
);

For requests it's not as easy as it does not contain the rank column and is not partitioned or clustered by page even if we do join.

@rviscomi what do you think about adding rank column to this table and clustering by it?

@rviscomi
Copy link
Member

Could this view query the corresponding latest view for simplicity?

@rviscomi what do you think about adding rank column to this table and clustering by it?

Is it possible to set clustering on views?

Or are you asking about clustering in the all.pages table? If so, it's already clustered by rank.

image

@tunetheweb
Copy link
Member Author

tunetheweb commented Oct 21, 2022

Could this view query the corresponding latest view for simplicity?

Yup! Done:

CREATE OR REPLACE VIEW `httparchive.sample_data.pages_1k` AS (
  SELECT
    *
  FROM
    `httparchive.latest.pages`
  WHERE
    rank = 1000
);

Seems to work.

@rviscomi what do you think about adding rank column to this table and clustering by it?

Is it possible to set clustering on views?

You don't set clustering on views - you set it on the underlying table. Think of a view as just a shorthand that gets swapped in just before the query runs.

Or are you asking about clustering in the all.pages table? If so, it's already clustered by rank.

I'm asking about adding rank to all.requests and clustering on that. It works fine for pages, just not requests. Even if I use the view to join to pages like this:

CREATE OR REPLACE VIEW `httparchive.sample_data.requests_1k` AS (
  SELECT
    r.*
  FROM
    `httparchive.latest.requests` r
  JOIN
    `httparchive.latest.pages`
  USING (date, client, page)
  WHERE
    rank = 1000
);

It still costs 1 TB to query this sample table - even for a simple SELECT COUNT(0) FROM httparchive.sample_data.requests_1k as it basically needs to generate the result of the above join (hence using all columns) and only then do the COUNT.

If there was a rank on the httparchive.all.requests table I could just use that without a join, and therefore BigQuery would pass on the query to the underlying httparchive.all.requests rather than running it, and so it would act like the pages_1k table.

@rviscomi
Copy link
Member

Oh sorry I misread and thought you were asking about pages. The requests table is already clustered by 4 fields (client, is_root_page, is_main_document, type) and IIUC that's the maximum number of fields to cluster by.

Adding rank to that table is probably a good idea, but we won't be able to cluster it.

@tunetheweb
Copy link
Member Author

Ah interesting. Then I don’t think we can have sample_data.requests_1k as a view. At least not a very responsive one. So probably need to make it a scheduled query to update (and might as well do sample_data.pages at same time for consistency). Not the worst, as no urgency on that so can just do near end of the month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants