`REFRESH` docs #27521

ggevay · 2024-06-07T17:13:18Z

As discussed with @morsapaes, this is a draft of the docs for REFRESH materialized views. It's based on https://www.notion.so/materialize/REFRESH-user-docs-draft-4a8f30b737a94619ac9f645abc9f84ce

I added a separate page under "Common patterns", as discussed here. However, I couldn't figure out how to actually make a link appear under that menu item. @morsapaes , could you please help with that?

I haven't yet updated create-materialized-view.md. I'd like to do that after #27325 is merged.

Motivation

This PR adds docs.

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

morsapaes · 2024-06-20T05:47:34Z

@ggevay: in the end, the content of the pattern page was reference documentation, so I split the base you provided across multiple existing pages instead (CREATE MATERIALIZED VIEW, CREATE CLUSTER, ALTER CLUSTER). Something I'd still like to include but left out is a more understandable description of different querying scenarios (assuming users are not familiar with the rehydration process or transaction isolation levels). Your original draft read:

The REHYDRATION TIME ESTIMATE (of which the default is 0) controls how much earlier to automatically turn on the cluster before a refresh time. This is to allow the cluster to complete rehydration already before the refresh time, so that the refresh can be performed (almost) instantaneously. This way, we avoid unavailability of the MV around refreshes: If the rehydration completes before the refresh time, then querying the MV during the rehydration will simply yield the pre-refresh contents of the MV. On the other hand, if the rehydration doesn't complete before the refresh time, then there will be a period where the MV is mostly not queryable, because queries would need to serve the post-refresh MV contents, but we haven't finished computing it yet. (It's actually still queryable in SERIALIZABLE transaction isolation mode on its own, because in that case it will serve the pre-refresh contents. However, it's not queryable even in SERIALIZABLE mode if you query it together with other objects, e.g., in a join or a union, because other objects' contents usually won't be available at a time that is before the refresh, unless RETAIN HISTORY ... is specified on them.)

Can you or @sthm articulate this in a clearer way? We should document what happens when:

1. Users try to query¹ a materialized view outside the refresh time, when the scheduled cluster is turned off.
2. Users try to query¹ materialized view outside the refresh time, when the scheduled cluster is hydrating ahead of the refresh time (i.e. for a cluster configured with REHYDRATION TIME ESTIMATE).
3. Users try to query¹ a materialized view during the refresh time, when the cluster has just been turned on.
4. Users query¹ a materialized view after the refresh is done.

Also noting that I left out this bit, though I think it's important and will work it in as a follow-up:

The specified refresh times are exact logical times: even if a refresh
physically completes a few seconds (or more) later than the specified time, the
results will be consistent with the state of the inputs as they were exactly at
the specified logical time.

Eventually, we should add a SQL pattern that has a practical walkthrough of a real example, and describes the logic of partitioning the data and all the hard bits (in line with this musing from @chuck-alt-delete).

simple queries, but also queries that join the scheduled materialized views with objects in non-scheduled clusters. ↩ ↩² ↩³ ↩⁴

chuck-alt-delete · 2024-06-20T17:43:47Z

Would it be possible to include a very simple example of a hot/warm split? A full "common patterns" guide would go into more detail, but it would be nice if the examples section of the reference documentation had a very simple implementation just to remind the user of how we'd like them to use this feature.

morsapaes · 2024-06-21T10:30:44Z

but it would be nice if the examples section of the reference documentation had a very simple implementation just to remind the user of how we'd like them to use this feature.

I don't have more time to work on this, and think any practical example (simple as it might be) should live under SQL Patterns, not reference documentation.

sthm · 2024-06-25T11:41:17Z

Can you or @sthm articulate this in a clearer way? We should document what happens when:

1. Users try to query1 a materialized view outside the refresh time, when the scheduled cluster is turned off. 2. Users try to query1 materialized view outside the refresh time, when the scheduled cluster is hydrating ahead of the refresh time (i.e. for a cluster configured with REHYDRATION TIME ESTIMATE). 3. Users try to query1 a materialized view during the refresh time, when the cluster has just been turned on. 4. Users query1 a materialized view after the refresh is done.

This explains what happens and why (and when) all these options are requires. It's much longer than I want it to be, but it's quite nuanced and I don't know how to shorten it without making it harder to understand:

For some use cases it makes sense to trade of freshness of data and cost. For instance, for some use cases it makes sense to keep results on the most recent data (say the last week) as fresh as possible. Changes to the inputs should be reflected in the outputs as quickly as possible. But once the data is older than a week, it's tolerable for changes in the inputs to take up to 24 hours until they are reflected in the outputs.

This pattern can be realized by creating a materialized view with an ON COMMIT refresh strategy for the data of the last week and a second materialized view for data older than a week with a REFRESH EVERY '1 day' refresh strategy. End users can then query the union of these two materialized views to query the entire result set that is refreshed with different strategies.

When queries to materialized views with REFRESH EVERY' refresh strategy combine data from other views there are some nuances when the query can return results. Assuming the cluster is healthy, it depends on the refresh schedule of the materialized view and when the previous and next refresh is happening: Queries can only provide answers between the last refresh of the materialized view and the time the next refresh is scheduled.

For instance, the following view refreshes once a day at midnight UTC.

CREATE MATERIALIZED VIEW mv_refresh_every
WITH (
  -- Refresh at creation, so the view is populated ahead of
  -- the first scheduled refresh on Jun 18
  REFRESH AT '2024-06-17 00:00:00',
  -- Refresh every day at midnight UTC
  REFRESH EVERY '1 day' ALIGNED TO '2024-04-17 00:00:00'
)
...

Assuming the last refresh happened on Jun 19 at midnight, queries will return, even if the cluster maintaining the view is turned off, between Jun 19 midnight until Jun 20 midnight. Queries will start to hang at Jun 20 midnight until the next refresh completed.

If the cluster is running continuously, the refresh happens promptly at midnight, minimizing the time that queries hang. But, the whole purpose of refresh strategies is to remove resources from clusters in between refreshes. To avoid hanging queries when the cluster is turned off, the cluster can be configured to automatically add resources to the cluster at the next scheduled refresh of any of it's materialized views.

ALTER CLUSTER my_refresh_cluster
SET (SCHEDULE = ON REFRESH);

Note, however, that it may take a considerable amount of time between the refresh starts and actually completes. Let's assume that in the above example it takes 23 min to complete the refresh of the materialized view mv_refresh_every. If the cluster is configured in the following way, resources will be automatically added around midnight and the refresh completes roughly 23 minutes later. This means, however, that queries will be stuck between midnight and 0:23, because queries will block between the refresh was due and it actually completed.

To avoid hanging queries as much as possible, the cluster can be configured to start the refresh before it is actually due.

ALTER CLUSTER my_refresh_cluster
SET (SCHEDULE = ON REFRESH (REHYDRATION TIME ESTIMATE = '30 min'));

With the preceding configuration, resources will already be added 30 minutes before midnight. In this way, the bulk of the required work (the so-called hydration of the materialized view) can already be done before midnight (and while queries can still return). Only right after midnight, queries may hang for a brief moment until the actual refresh is completed.

ggevay

I did a few minor fixes:

There was this "ahead of the first scheduled refresh". Since the AT CREATION is also a scheduled refresh, this was not entirely accurate. I've changed this at several places to things like "ahead of the first EVERY refresh".
There was "We recommend always using the REFRESH AT CREATION strategy with REFRESH EVERY"
- I moved this to REFRESH EVERY, because when using REFRESH EVERY is when the user should definitely see it.
- I rephrased it a bit, because the wording was symmetric between REFRESH EVERY and REFRESH AT CREATION, but actually only one of these makes the other recommended.
There was "and any indexes built on these views". I modified this to "and any indexes supporting these views". This is because it's ok to have indexed views that REFRESH materialized views read from. (In fact, https://github.com/MaterializeInc/accounts/issues/3 does have them, for CSE purposes.) (Also note that REFRESH materialized views are typically not indexed on the refresh cluster, and even if they are, it's only in support of other REFRESH materialized views. This is because these clusters are not always on, so these indexes are not good for serving.)
And a few even more minor things.

doc/user/content/sql/system-catalog/mz_catalog.md

ggevay · 2024-07-02T15:02:03Z

One more question: Where should we work in Steffen's text?

doc/user/content/sql/create-materialized-view.md

ggevay · 2024-07-09T11:46:12Z

@morsapaes
If we don't have the time to incorporate Steffen's text at the moment, we could just merge the current version, and then incorporate Steffen's text later.

ggevay · 2024-07-09T13:10:17Z

Thank you very much for all the feedback and improvements! Merging!

ggevay added the A-docs Area: documentation label Jun 7, 2024

ggevay force-pushed the refresh-docs branch from a1d579c to 4bef4a1 Compare June 8, 2024 11:40

ggevay requested a review from morsapaes June 8, 2024 11:41

ggevay mentioned this pull request Jun 17, 2024

REFRESH options #26010

Open

morsapaes force-pushed the refresh-docs branch from 2e0a954 to b6acc69 Compare June 20, 2024 05:32

morsapaes requested a review from sthm June 20, 2024 05:33

morsapaes marked this pull request as ready for review June 20, 2024 05:51

morsapaes approved these changes Jun 21, 2024

View reviewed changes

ggevay and others added 2 commits July 2, 2024 15:44

Add REFRESH docs

55a105f

Break page down into reference documentation

a485189

ggevay force-pushed the refresh-docs branch from eb2ea98 to a6e7ab6 Compare July 2, 2024 15:00

ggevay commented Jul 2, 2024

View reviewed changes

doc/user/content/sql/system-catalog/mz_catalog.md Show resolved Hide resolved

doc/user/content/sql/system-catalog/mz_catalog.md Show resolved Hide resolved

ggevay force-pushed the refresh-docs branch from a6e7ab6 to 94d892f Compare July 2, 2024 15:50

ggevay commented Jul 2, 2024

View reviewed changes

doc/user/content/sql/create-materialized-view.md Outdated Show resolved Hide resolved

docs: Minor fixes around REFRESH

ee083e0

ggevay force-pushed the refresh-docs branch from 94d892f to ee083e0 Compare July 3, 2024 08:31

Clarify scenarios

e266d50

ggevay enabled auto-merge July 9, 2024 13:10

ggevay merged commit 1e4a041 into MaterializeInc:main Jul 9, 2024
11 checks passed

materialize-bot mentioned this pull request Jul 11, 2024

release: v0.108.0 required reviews #28188

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`REFRESH` docs #27521

`REFRESH` docs #27521

ggevay commented Jun 7, 2024

morsapaes commented Jun 20, 2024 •

edited

Loading

chuck-alt-delete commented Jun 20, 2024

morsapaes commented Jun 21, 2024 •

edited

Loading

sthm commented Jun 25, 2024 •

edited

Loading

ggevay left a comment •

edited

Loading

ggevay commented Jul 2, 2024

ggevay commented Jul 9, 2024

ggevay commented Jul 9, 2024

REFRESH docs #27521

REFRESH docs #27521

Conversation

ggevay commented Jun 7, 2024

Motivation

Tips for reviewer

Checklist

morsapaes commented Jun 20, 2024 • edited Loading

Footnotes

chuck-alt-delete commented Jun 20, 2024

morsapaes commented Jun 21, 2024 • edited Loading

sthm commented Jun 25, 2024 • edited Loading

ggevay left a comment • edited Loading

Choose a reason for hiding this comment

ggevay commented Jul 2, 2024

ggevay commented Jul 9, 2024

ggevay commented Jul 9, 2024

`REFRESH` docs #27521

`REFRESH` docs #27521

morsapaes commented Jun 20, 2024 •

edited

Loading

morsapaes commented Jun 21, 2024 •

edited

Loading

sthm commented Jun 25, 2024 •

edited

Loading

ggevay left a comment •

edited

Loading