Dataflow builder issues #8386

wangandi · 2021-09-22T15:23:03Z

wangandi
Sep 22, 2021

This serves as a place to collect known issues, comments, and discussion items involving the dataflow builder that way we can have a cross coord+dataflow discussion on how to fix the Dataflow builder to deal with all the issues.

Coordinator side issue: #8318
Unnecessary arrangement import issue: #4887

cc: @mjibson, @asenac, @frankmcsherry

frankmcsherry · 2021-09-22T20:18:12Z

frankmcsherry
Sep 22, 2021
Maintainer

One of the prominent issues is that the coordinator often comes to DataflowBuilder with the intention of allowing a subset of indexes to be used, but the DataflowBuilder directly accesses self.indexes (or something like it) not realizing that one might want to hold these back. It seems reasonable that the builder could be created with a allowed_indexes: &'a HashSet<GlobalId> argument, or perhaps an Option<_> thereof to keep things simple when there are no constraints.

0 replies

wangandi · 2021-10-21T14:37:10Z

wangandi
Oct 21, 2021
Author

Copying this Slack message by @asenac over here since it seems relevant to the dataflow building discussion:

Idea for cross-view optimizations: currently we do a bunch of things in transform/src/dataflow.rs propagating information across views or to sources. We don't do it within transforms because transforms don't go beyond the context of a single view. However, if we modeled the dataflowdesc as a multi-entry-point single graph, we could simply use the regular transforms for propagating that info.

0 replies

wangandi · 2021-12-15T16:03:05Z

wangandi
Dec 15, 2021
Author

I grabbed @mjibson in person on Friday to try and gain a shared understanding of the problem involving the Dataflow builder. The following is a write-up of understanding as of Friday. @mjibson feel free to correct if I've gotten something wrong.

Coordinator issues

#8318 is the overarching issue for dataflow builder issues that affect the coordinator team. It references two issues, both of which have workarounds that have been merged to main:

panic: global id missing at coordinator #8021 "indexes present in the catalog" (self.catalog.enabled_indexes()) is not necessarily the same as "indexes that have shipped" (self.indexes). When Materialize restarts, self.catalog.enabled_indexes() represents indexes that existed at the time of the last shut down that Materialize now must reship.
panic: Dataflow temp-view-tX requested as_of (Antichain { elements: [1631184286999] }) not >= since (Antichain { elements: [1631184289999] }) #8241 "indexes present in the catalog" is not necessarily the same as "indexes that have shipped as of the start of a read transaction". There's no variable yet that represents "indexes that have shipped as of the start of a read transaction"; the workaround actually removes indexes that were created after the start of the read transaction before shipping the dataflow.

#8241 specifically refers to Materialize crashing if User1 adds an index while User2 has a read transaction. I asked @mjibson if we can run into a problem into if User1 deletes an index while User2 has a read transaction. He says that to our best knowledge, we will not run into a problem. But the mechanism preventing problems in this scenario is not in the coordinator. He believes that problems are prevented by reference counting in the dataflow layer or something like that. We should figure out the full answer by the time we redesigning the DataflowBuilder.

Design of DataflowBuilder

We talked a bit about the current DataflowBuilder struct has immutable reference fields and how that makes implementing methods that mutably borrow DataflowBuilder (which is all of them) hard. I pointed out though, that the amount of cloning done in the optimizer is several orders of magnitude greater than any cloning we would do in the DataflowBuilder to get around those Rust restrictions. After all, the DataflowBuilder would just clone some IDs and likely simple MirScalarExprs.

0 replies

wangandi · 2021-12-16T23:03:22Z

wangandi
Dec 16, 2021
Author

Questions for the optimizer/dataflow end

The following are questions that I think we would need to come to a decision on before we can proceed with a design on the dataflow builder. Each question comes with my opinion of what the answer should be.

Question 1: What is the exact set of available indexes that optimizer wants the coordinator to provide for a given query?
A: The indexes should fulfill three conditions:

Come from only the materialized views in the DataflowDesc.
Have shipped as of the start of a read transaction. Indexes that have not shipped at the start of a read transaction may not be able to results as of the transaction's particular timestamp.
Are hydrated as of the time that the particular query is being computed. @philip-stoev noticed yesterday that currently, CREATE INDEX returns immediately, and one-off queries immediately try to use the created index without waiting for the index to be hydrated, which leads to performance degradation. https://materializeinc.slack.com/archives/CMHDK0DK8/p1639559848130400 It is acceptable for a later query in a transaction to use a hydrated index that was not hydrated at the beginning of the read transaction because the answer should not change.

I consider it acceptable for an initial version of the refactored DataflowBuilder to provide all indexes that fulfill condition 2 and worry about narrowing the set down to ones that fulfill the other conditions later.

Question 2: What do the optimizer and dataflow layers consider the coordinator's responsibilities to be?
A:
Currently, the coordinator is responsible for the following:

Include the definition of every unmaterialized source or view in the DataflowDesc.
Include at least one index for each materialized source or view in the DataflowDesc.
Pass the list of indexes to the optimizer.

Right now, the optimizer does not tell the coordinator which indexes should be imported, but it should. When the optimizer is capable of telling the coordinator which indexes should be imported, I think:

responsibility (2) should become the job of the last step of generating the LIR (dataflow_types::plan::Plan).
Responsibilities (1) and (3) will remain with the coordinator. The set of indexes used to determine whether a view in a DataflowDesc is materialized or unmaterialized should be the set passed along to the optimizer.
The coordinator will gain the new responsibility of adding the indexes requested by the optimizer to DataflowDesc.

Question 3: How should the coordinator pass the list of available indexes to the optimizer?
A:
Currently, the coordinator passes the list of available indexes to both the MIR logical and MIR physical optimizer. The logical optimizer does not actually need index information. Meanwhile, knowledge of indexes is required at the stage where the LIR is created (dataflow_types::Plan::finalize_dataflow), but the LIR gets this knowledge not from a list of available indexes but from which indexes have been imported into the DataflowDesc.

Thus, the coordinator should skip passing the list of available indexes to the logical optimizer but pass the available indexes to the MIR physical optimizer dataflow_types::Plan::finalize_dataflow. That said, I believe @asenac is of the opinion that the physical optimizer and dataflow_types::Plan::finalize_dataflow should be fused.

1 reply

wangandi Dec 20, 2021
Author

Just remembered that the main obstacle to fusing the MIR physical optimizer and dataflow_types::Plan::finalize_dataflow is the presence of a bunch of MIR level transforms after JoinImplementation:
https://github.com/MaterializeInc/materialize/blob/main/src/transform/src/lib.rs#L325

As an example, select * from x inner join y on x.col1 = y.col1 should look something like this right before we apply JoinImplementation.

%0
| Get x
| Filter (<x.col1 is not null>)

%1
| Get y
| Filter (<y.col1 is not null>) 

%2
| Join %0 %1 (<x.col1 is not distinct from y.col1>)

If we use a preexisting index on x(col1), then Filter (<x.col1 is not null>) gets lifted.

%0
| Get x
| ArrangeBy (#0)

%1
| Get y
| Filter (<y.col1 is not null>) 

%2
| Join %0 %1 (<x.col1 is not distinct from y.col1>)
| | Differential %1 %0(#0)
| Filter (<x.col1 is not null>)

But lifted filter can be deleted because the constraints <y.col1 is not null> and <x.col1 is not distinct from y.col1> together imply <x.col1 is not null>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow builder issues #8386

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Dataflow builder issues #8386

wangandi Sep 22, 2021

Replies: 4 comments · 1 reply

frankmcsherry Sep 22, 2021 Maintainer

wangandi Oct 21, 2021 Author

wangandi Dec 15, 2021 Author

Coordinator issues

Design of DataflowBuilder

wangandi Dec 16, 2021 Author

Questions for the optimizer/dataflow end

wangandi Dec 20, 2021 Author

wangandi
Sep 22, 2021

Replies: 4 comments 1 reply

frankmcsherry
Sep 22, 2021
Maintainer

wangandi
Oct 21, 2021
Author

wangandi
Dec 15, 2021
Author

wangandi
Dec 16, 2021
Author

wangandi Dec 20, 2021
Author