Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow reading a stale materialized view #13484

Closed
wants to merge 1 commit into from
Closed

Conversation

findepi
Copy link
Member

@findepi findepi commented Aug 3, 2022

Fixes #10606

@cla-bot cla-bot bot added the cla-signed label Aug 3, 2022
@findepi
Copy link
Member Author

findepi commented Aug 3, 2022

@@ -98,6 +100,7 @@
private boolean legacyCatalogRoles;
private boolean incrementalHashArrayLoadFactorEnabled = true;
private boolean allowSetViewAuthorization;
private Duration materializedViewRequiredFreshness = new Duration(0, SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that most likely should be a MV property

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should definitely be allowed as an MV property as it is much easier to think about how much staleness is acceptable for specific MVs rather than defining it globally.
In fact, I would lean against making it configurable globally at all as it seems very tricky for an admin to come up with an acceptable staleness value for all MVs. It also creates the possibility of end users unexpectedly receiving stale results without having opted into it explicitly through MV property or session property.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data will likely have different retention policy, therefore it has to be defined per MV

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sopel39 @raunaqmorarka @hashhar
mistakenly replied in a different (related) thread here: #13484 (comment)

Copy link
Member

@anjalinorwood anjalinorwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great enhancement. Thank you for adding it. LGTM.

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests as well

@@ -2025,8 +2025,7 @@ public Optional<ConnectorOutputMetadata> finishRefreshMaterializedView(

String dependencies = sourceTableHandles.stream()
.map(handle -> (IcebergTableHandle) handle)
.filter(handle -> handle.getSnapshotId().isPresent())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about this commit, if tables created by trino always have a snapshot id and CREATE MV cannot be pointed to pre-existing tables, then can this scenario arise in practise ?
Could we require that iceberg table always have a snapshot id here instead ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about this commit, if tables created by trino always have a snapshot id and CREATE MV cannot be pointed to pre-existing tables, then can this scenario arise in practise ?

see 5c1750e#r80166651

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks
can we have a test which creates MV in trino on source iceberg table created by spark ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should (as part of #4832 ...)

@@ -98,6 +100,7 @@
private boolean legacyCatalogRoles;
private boolean incrementalHashArrayLoadFactorEnabled = true;
private boolean allowSetViewAuthorization;
private Duration materializedViewRequiredFreshness = new Duration(0, SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should definitely be allowed as an MV property as it is much easier to think about how much staleness is acceptable for specific MVs rather than defining it globally.
In fact, I would lean against making it configurable globally at all as it seems very tricky for an admin to come up with an acceptable staleness value for all MVs. It also creates the possibility of end users unexpectedly receiving stale results without having opted into it explicitly through MV property or session property.

@@ -1783,8 +1788,8 @@ protected Scope visitTable(Table table, Optional<Scope> scope)

Optional<MaterializedViewDefinition> optionalMaterializedView = metadata.getMaterializedView(session, name);
if (optionalMaterializedView.isPresent()) {
if (metadata.getMaterializedViewFreshness(session, name).isMaterializedViewFresh()) {
// If materialized view is current, answer the query using the storage table
if (isMaterializedViewSufficientlyFresh(name)) {
Copy link
Member

@raunaqmorarka raunaqmorarka Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should let this be an implementation detail of the the connector's metadata.getMaterializedViewFreshness code. Iceberg MV code could figure out the "sufficiently fresh" condition and indicate freshness to the engine through existing boolean flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to differentiate two use-cases

  • staleness for explicit querying
    • i didn't yet grasp why we need this, but Trino MVs accept no staleness today, unlike eg PostgreSQL's or Oracle's. In those systems, the user has control over when the view is refreshed (as in Trino), so it feels OK to query the MV directly if the user asks to, irrespective of the staleness. We chose hybrid approach, but -- for performance and cost control reasons -- users want to query stale MVs directly.
      i think it's fine to have this configurable on per view basis, but I expect we will later need to add session-level controls (overrides) as well.
      The question is how to we model this for per-MV. I can easily make this a connector property, but this is something we will want to define for all connectors, which calls for some unification (syntax?).
      If we agree that we want to have both per-MV and per-session controls, does it matter which one we start with?
  • staleness for query rewrites
    • this one is more tricky. I believe our MVs' "fresh or inline" philosophy was picked with query rewrites as the primary use-case (so departing from apparent experience of other systems). The query rewrites will also want to accept certain staleness, but this use-case is much more delicate, as the logic is implicit. I am sure we will want per-MV and per-session controls for this feature when we build it (but building it is not a goal for this PR)

@Override
public boolean equals(Object obj)
public boolean equals(Object o)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip variable rename or make it separate commit ?

sb.append("materializedViewFresh=").append(materializedViewFresh);
sb.append('}');
return sb.toString();
return new StringJoiner(", ", MaterializedViewFreshness.class.getSimpleName() + "[", "]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringBuilder -> StringJoiner change in separate commit ?

@martint
Copy link
Member

martint commented Aug 4, 2022

I’m not sure a global setting for this is appropriate. There’s usually not a one-size-fits-all value that captures the requirements for different use cases.

Also, it’s a departure from the model we’ve been working towards, which is that materialized views have the semantics of a view. I.e., always fresh, either because it’s already up to date, refreshed on the fly, or inlined and computed like a normal view. We also discussed possible syntax extensions to allow a user to indicate how much staleness they are willing to tolerate on a case by case basis.

@findepi
Copy link
Member Author

findepi commented Aug 4, 2022

model we’ve been working towards, which is that materialized views have the semantics of a view. I.e., always fresh, either because it’s already up to date, refreshed on the fly, or inlined and computed like a normal view.

This is the ideal situation from query engine perspective, but it's too strict in practice for end-users.
It's OK for a user to have eg a daily-refreshed view, and accept a day of staleness, to avoid expensive computations.

For example, PostgreSQL materialized view feature https://www.postgresql.org/docs/14/rules-materializedviews.html started off from a different angle: one has explicit refreshes and implicitly accepts any amount of staleness. While our fresh-or-inline approach looks nicer, I don't think PostgreSQL's approach is absurd either. Both have their strengths and weaknesses. BTW PostgreSQL's approach seems to be directionally aligned with Oracle's.

@@ -479,6 +482,20 @@ public FeaturesConfig setAllowSetViewAuthorization(boolean allowSetViewAuthoriza
return this;
}

@NotNull
public Duration getMaterializedViewRequiredFreshness()
Copy link
Member

@sopel39 sopel39 Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much property of data, not a system-wide property. I also think MV refresh scheduler should be outside of core engine and engine should only be responsible for executing REFRESH or SELECT queries.

I don't think that engine needs to know freshness duration if it cannot do much about if without a scheduler.

High level question is:

  • Do we need resolution as MV as view at all?
    • If no, then the whole concept of freshness doesn't make sense.
    • If yes, then IMO it's the connector that should be responsible for business (time-based) logic whether MV is fresh or not. For example MV freshness might be tightly coupled with refresh interval and MV can be refreshed based on regex (e.g. every day on midnight).

@findepi findepi mentioned this pull request Aug 9, 2022
@findepi findepi self-assigned this Aug 12, 2022
@findepi findepi force-pushed the findepi/read-stale-mv branch from ef0055a to f8d0177 Compare August 24, 2022 12:34
@findepi
Copy link
Member Author

findepi commented Aug 24, 2022

(just rebased)

@findepi
Copy link
Member Author

findepi commented Sep 17, 2022

@romanvainb will follow up sooner or later

@findepi findepi closed this Sep 17, 2022
@findepi findepi deleted the findepi/read-stale-mv branch September 17, 2022 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

Allow reading from a materialized view (storage table) when view is stale (not fresh)
5 participants