Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REFRESH docs #27521

Merged
merged 4 commits into from
Jul 9, 2024
Merged

REFRESH docs #27521

merged 4 commits into from
Jul 9, 2024

Conversation

ggevay
Copy link
Contributor

@ggevay ggevay commented Jun 7, 2024

As discussed with @morsapaes, this is a draft of the docs for REFRESH materialized views. It's based on https://www.notion.so/materialize/REFRESH-user-docs-draft-4a8f30b737a94619ac9f645abc9f84ce

I added a separate page under "Common patterns", as discussed here. However, I couldn't figure out how to actually make a link appear under that menu item. @morsapaes , could you please help with that?

I haven't yet updated create-materialized-view.md. I'd like to do that after #27325 is merged.

Motivation

  • This PR adds docs.

Tips for reviewer

Checklist

@morsapaes
Copy link
Contributor

morsapaes commented Jun 20, 2024

@ggevay: in the end, the content of the pattern page was reference documentation, so I split the base you provided across multiple existing pages instead (CREATE MATERIALIZED VIEW, CREATE CLUSTER, ALTER CLUSTER). Something I'd still like to include but left out is a more understandable description of different querying scenarios (assuming users are not familiar with the rehydration process or transaction isolation levels). Your original draft read:

The REHYDRATION TIME ESTIMATE (of which the default is 0) controls how much earlier to automatically turn on the cluster before a refresh time. This is to allow the cluster to complete rehydration already before the refresh time, so that the refresh can be performed (almost) instantaneously. This way, we avoid unavailability of the MV around refreshes: If the rehydration completes before the refresh time, then querying the MV during the rehydration will simply yield the pre-refresh contents of the MV. On the other hand, if the rehydration doesn't complete before the refresh time, then there will be a period where the MV is mostly not queryable, because queries would need to serve the post-refresh MV contents, but we haven't finished computing it yet. (It's actually still queryable in SERIALIZABLE transaction isolation mode on its own, because in that case it will serve the pre-refresh contents. However, it's not queryable even in SERIALIZABLE mode if you query it together with other objects, e.g., in a join or a union, because other objects' contents usually won't be available at a time that is before the refresh, unless RETAIN HISTORY ... is specified on them.)

Can you or @sthm articulate this in a clearer way? We should document what happens when:

1. Users try to query1 a materialized view outside the refresh time, when the scheduled cluster is turned off.
2. Users try to query1 materialized view outside the refresh time, when the scheduled cluster is hydrating ahead of the refresh time (i.e. for a cluster configured with REHYDRATION TIME ESTIMATE).
3. Users try to query1 a materialized view during the refresh time, when the cluster has just been turned on.
4. Users query1 a materialized view after the refresh is done.

Also noting that I left out this bit, though I think it's important and will work it in as a follow-up:

The specified refresh times are exact logical times: even if a refresh
physically completes a few seconds (or more) later than the specified time, the
results will be consistent with the state of the inputs as they were exactly at
the specified logical time.

Eventually, we should add a SQL pattern that has a practical walkthrough of a real example, and describes the logic of partitioning the data and all the hard bits (in line with this musing from @chuck-alt-delete).

Footnotes

  1. simple queries, but also queries that join the scheduled materialized views with objects in non-scheduled clusters. 2 3 4

@morsapaes morsapaes marked this pull request as ready for review June 20, 2024 05:51
@chuck-alt-delete
Copy link
Contributor

Would it be possible to include a very simple example of a hot/warm split? A full "common patterns" guide would go into more detail, but it would be nice if the examples section of the reference documentation had a very simple implementation just to remind the user of how we'd like them to use this feature.

@morsapaes
Copy link
Contributor

morsapaes commented Jun 21, 2024

but it would be nice if the examples section of the reference documentation had a very simple implementation just to remind the user of how we'd like them to use this feature.

I don't have more time to work on this, and think any practical example (simple as it might be) should live under SQL Patterns, not reference documentation.

@sthm
Copy link
Contributor

sthm commented Jun 25, 2024

Can you or @sthm articulate this in a clearer way? We should document what happens when:

1. Users try to query1 a materialized view outside the refresh time, when the scheduled cluster is turned off. 2. Users try to query1 materialized view outside the refresh time, when the scheduled cluster is hydrating ahead of the refresh time (i.e. for a cluster configured with REHYDRATION TIME ESTIMATE). 3. Users try to query1 a materialized view during the refresh time, when the cluster has just been turned on. 4. Users query1 a materialized view after the refresh is done.

This explains what happens and why (and when) all these options are requires. It's much longer than I want it to be, but it's quite nuanced and I don't know how to shorten it without making it harder to understand:


For some use cases it makes sense to trade of freshness of data and cost. For instance, for some use cases it makes sense to keep results on the most recent data (say the last week) as fresh as possible. Changes to the inputs should be reflected in the outputs as quickly as possible. But once the data is older than a week, it's tolerable for changes in the inputs to take up to 24 hours until they are reflected in the outputs.

This pattern can be realized by creating a materialized view with an ON COMMIT refresh strategy for the data of the last week and a second materialized view for data older than a week with a REFRESH EVERY '1 day' refresh strategy. End users can then query the union of these two materialized views to query the entire result set that is refreshed with different strategies.

When queries to materialized views with REFRESH EVERY' refresh strategy combine data from other views there are some nuances when the query can return results. Assuming the cluster is healthy, it depends on the refresh schedule of the materialized view and when the previous and next refresh is happening: Queries can only provide answers between the last refresh of the materialized view and the time the next refresh is scheduled.

For instance, the following view refreshes once a day at midnight UTC.

CREATE MATERIALIZED VIEW mv_refresh_every
WITH (
  -- Refresh at creation, so the view is populated ahead of
  -- the first scheduled refresh on Jun 18
  REFRESH AT '2024-06-17 00:00:00',
  -- Refresh every day at midnight UTC
  REFRESH EVERY '1 day' ALIGNED TO '2024-04-17 00:00:00'
)
...

Assuming the last refresh happened on Jun 19 at midnight, queries will return, even if the cluster maintaining the view is turned off, between Jun 19 midnight until Jun 20 midnight. Queries will start to hang at Jun 20 midnight until the next refresh completed.

If the cluster is running continuously, the refresh happens promptly at midnight, minimizing the time that queries hang. But, the whole purpose of refresh strategies is to remove resources from clusters in between refreshes. To avoid hanging queries when the cluster is turned off, the cluster can be configured to automatically add resources to the cluster at the next scheduled refresh of any of it's materialized views.

ALTER CLUSTER my_refresh_cluster
SET (SCHEDULE = ON REFRESH);

Note, however, that it may take a considerable amount of time between the refresh starts and actually completes. Let's assume that in the above example it takes 23 min to complete the refresh of the materialized view mv_refresh_every. If the cluster is configured in the following way, resources will be automatically added around midnight and the refresh completes roughly 23 minutes later. This means, however, that queries will be stuck between midnight and 0:23, because queries will block between the refresh was due and it actually completed.

To avoid hanging queries as much as possible, the cluster can be configured to start the refresh before it is actually due.

ALTER CLUSTER my_refresh_cluster
SET (SCHEDULE = ON REFRESH (REHYDRATION TIME ESTIMATE = '30 min'));

With the preceding configuration, resources will already be added 30 minutes before midnight. In this way, the bulk of the required work (the so-called hydration of the materialized view) can already be done before midnight (and while queries can still return). Only right after midnight, queries may hang for a brief moment until the actual refresh is completed.

Copy link
Contributor Author

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a few minor fixes:

  • There was this "ahead of the first scheduled refresh". Since the AT CREATION is also a scheduled refresh, this was not entirely accurate. I've changed this at several places to things like "ahead of the first EVERY refresh".
  • There was "We recommend always using the REFRESH AT CREATION strategy with REFRESH EVERY"
    • I moved this to REFRESH EVERY, because when using REFRESH EVERY is when the user should definitely see it.
    • I rephrased it a bit, because the wording was symmetric between REFRESH EVERY and REFRESH AT CREATION, but actually only one of these makes the other recommended.
  • There was "and any indexes built on these views". I modified this to "and any indexes supporting these views". This is because it's ok to have indexed views that REFRESH materialized views read from. (In fact, https://github.com/MaterializeInc/accounts/issues/3 does have them, for CSE purposes.) (Also note that REFRESH materialized views are typically not indexed on the refresh cluster, and even if they are, it's only in support of other REFRESH materialized views. This is because these clusters are not always on, so these indexes are not good for serving.)
  • And a few even more minor things.

@ggevay
Copy link
Contributor Author

ggevay commented Jul 2, 2024

One more question: Where should we work in Steffen's text?

@ggevay
Copy link
Contributor Author

ggevay commented Jul 9, 2024

@morsapaes
If we don't have the time to incorporate Steffen's text at the moment, we could just merge the current version, and then incorporate Steffen's text later.

@ggevay
Copy link
Contributor Author

ggevay commented Jul 9, 2024

Thank you very much for all the feedback and improvements! Merging!

@ggevay ggevay enabled auto-merge July 9, 2024 13:10
@ggevay ggevay merged commit 1e4a041 into MaterializeInc:main Jul 9, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants