Create auto updating `sample_data` queries #150

tunetheweb · 2022-10-21T20:59:21Z

These could be views on the latest all tables - similar to latest tables in #141 but with a reduced dataset.

The rank=1000 websites might be a good fit here. Less random, but maybe that's a good thing?

The text was updated successfully, but these errors were encountered:

tunetheweb · 2022-10-21T21:19:34Z

For pages this is easy to create as a view:

CREATE OR REPLACE VIEW `httparchive.sample_data.pages_1k` AS (
  SELECT
    *
  FROM
    `httparchive.all.pages`
  WHERE
    date IS NOT NULL AND
    date = (
      SELECT
        CAST(REGEXP_REPLACE(MAX(partition_id), r'(\d{4})(\d{2})(\d{2})', '\\1-\\2-\\3') AS DATE) AS date
      FROM
        `httparchive.all.INFORMATION_SCHEMA.PARTITIONS`
      WHERE
        table_name = 'pages' AND
        partition_id != '__NULL__') AND
    rank = 1000
);

For requests it's not as easy as it does not contain the rank column and is not partitioned or clustered by page even if we do join.

@rviscomi what do you think about adding rank column to this table and clustering by it?

rviscomi · 2022-10-21T21:30:57Z

Could this view query the corresponding latest view for simplicity?

@rviscomi what do you think about adding rank column to this table and clustering by it?

Is it possible to set clustering on views?

Or are you asking about clustering in the all.pages table? If so, it's already clustered by rank.

tunetheweb · 2022-10-21T21:42:50Z

Could this view query the corresponding latest view for simplicity?

Yup! Done:

CREATE OR REPLACE VIEW `httparchive.sample_data.pages_1k` AS (
  SELECT
    *
  FROM
    `httparchive.latest.pages`
  WHERE
    rank = 1000
);

Seems to work.

@rviscomi what do you think about adding rank column to this table and clustering by it?

Is it possible to set clustering on views?

You don't set clustering on views - you set it on the underlying table. Think of a view as just a shorthand that gets swapped in just before the query runs.

Or are you asking about clustering in the all.pages table? If so, it's already clustered by rank.

I'm asking about adding rank to all.requests and clustering on that. It works fine for pages, just not requests. Even if I use the view to join to pages like this:

CREATE OR REPLACE VIEW `httparchive.sample_data.requests_1k` AS (
  SELECT
    r.*
  FROM
    `httparchive.latest.requests` r
  JOIN
    `httparchive.latest.pages`
  USING (date, client, page)
  WHERE
    rank = 1000
);

It still costs 1 TB to query this sample table - even for a simple SELECT COUNT(0) FROM httparchive.sample_data.requests_1k as it basically needs to generate the result of the above join (hence using all columns) and only then do the COUNT.

If there was a rank on the httparchive.all.requests table I could just use that without a join, and therefore BigQuery would pass on the query to the underlying httparchive.all.requests rather than running it, and so it would act like the pages_1k table.

rviscomi · 2022-10-21T21:55:01Z

Oh sorry I misread and thought you were asking about pages. The requests table is already clustered by 4 fields (client, is_root_page, is_main_document, type) and IIUC that's the maximum number of fields to cluster by.

Adding rank to that table is probably a good idea, but we won't be able to cluster it.

tunetheweb · 2022-10-21T22:00:03Z

Ah interesting. Then I don’t think we can have sample_data.requests_1k as a view. At least not a very responsive one. So probably need to make it a scheduled query to update (and might as well do sample_data.pages at same time for consistency). Not the worst, as no urgency on that so can just do near end of the month.

max-ostapenko mentioned this issue Sep 9, 2024

Sample data tables HTTPArchive/dataform#7

Merged

max-ostapenko closed this as completed in HTTPArchive/dataform#7 Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create auto updating `sample_data` queries #150

Create auto updating `sample_data` queries #150

tunetheweb commented Oct 21, 2022

tunetheweb commented Oct 21, 2022 •

edited

Loading

rviscomi commented Oct 21, 2022

tunetheweb commented Oct 21, 2022 •

edited

Loading

rviscomi commented Oct 21, 2022

tunetheweb commented Oct 21, 2022

Create auto updating sample_data queries #150

Create auto updating sample_data queries #150

Comments

tunetheweb commented Oct 21, 2022

tunetheweb commented Oct 21, 2022 • edited Loading

rviscomi commented Oct 21, 2022

tunetheweb commented Oct 21, 2022 • edited Loading

rviscomi commented Oct 21, 2022

tunetheweb commented Oct 21, 2022

Create auto updating `sample_data` queries #150

Create auto updating `sample_data` queries #150

tunetheweb commented Oct 21, 2022 •

edited

Loading

tunetheweb commented Oct 21, 2022 •

edited

Loading