-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Substrait's VirtualTables #10531
Conversation
@@ -1277,6 +1407,56 @@ pub(crate) fn from_substrait_literal(lit: &Literal) -> Result<ScalarValue> { | |||
s, | |||
) | |||
} | |||
Some(LiteralType::List(l)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't necessarily need these for this PR, so if it's preferrable I'm happy to split them into separate PRs - I just added them while coding to confirm that lists and structs inside lists are handled correctly.
Same applies for the ones in producer.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is convenient, using separate PRs would make the review process easier. Perhaps one PR could focus on the virtual table, while the other two could address the list and struct separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let ctx = create_context().await?; | ||
let row1 = vec![lit(1), lit("a")]; | ||
let row2 = vec![lit(2), lit("b")]; | ||
let plan = LogicalPlanBuilder::values(vec![row1, row2])?.build()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find VALUES
in the DataFusion SQL - is it not implemented yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is implemented, and you can use it like this.
roundtrip("VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))").await
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm blind, thanks!
I tried switching to that, but it produces somewhat ugly plans as it uses make_array
and struct
as UDFs, making the assert fail:
left: "Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], []]) AS make_array(make_array(Float64(-213.1),NULL,Float64(5.5),Float64(2),Float64(1)),make_array()), Struct({c0:true,c1:1}) AS struct(Boolean(true),Int64(1))), (Int64(NULL), Utf8(NULL), List(), Struct({c0:,c1:}))"
right: "Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], []]), Struct({c0:true,c1:1})), (Int64(NULL), Utf8(NULL), List(), Struct({c0:,c1:}))"
The AS
aliases it produces seem to get ignored (the column names are still column1, column2, column3, ..), so it doesn't seem to make sense to map them into Substrait (and I'm not even sure if that'd be possible). Any ideas how to go around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. For Values plan, these aliases should be able to be ignored.
I think we could use assert_expected_plan
during testing to ignore aliases, for example:
#[tokio::test]
async fn roundtrip_values() -> Result<()> {
assert_expected_plan(
"VALUES (1, 'a', [[1,2,3], []], STRUCT(true, 1))",
"Values: (Int64(1), Utf8(\"a\"), List([[1, 2, 3], []]), Struct({c0:true,c1:1}))",
)
.await
}
Additionally, aliases should be ignored when handling LogicalPlan::Values
in to_substrait_rel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, that works - 18b7f08.
Thanks @Blizzara -- I started the CI on this PR |
Some(LiteralType::Null(ntype)) => from_substrait_null(ntype)?, | ||
Some(LiteralType::List(l)) => { | ||
let element_dt: Option<&DataType> = match dt { | ||
Some(DataType::List(l)) => Ok(Some(l.data_type())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if dt
is redundant here and can be removed. We still need to check whether it is a LargeList
, just as we check type_variation_reference
below.
Is it possible to only extract element_dt
from elements[0]
? I guess that elements
are not empty because substrait has a special type called EmptyList.
pub struct List {
/// A homogeneously-typed list of one or more expressions that form the
/// list entries. To specify an empty list, use Literal.empty_list
/// (otherwise type information would be missing).
#[prost(message, repeated, tag="1")]
pub values: ::prost::alloc::vec::Vec<super::super::Expression>,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we already do that (look at the first element) if dt
is not given. I think the reason I included dt
was because Substrait types don't include the field names for structs, so not using dt
would mean using something like names
and name_idx
instead (like in from_substrait_type_with_names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation @Blizzara . My concern is that using two types for List simultaneously is a bit verbose unless it is necessary.
I tested the following two cases: the first one panicked, and the second one also failed.
#[tokio::test]
async fn roundtrip_list_iteral() -> Result<()> {
roundtrip("SELECT [] from data").await
}
#[tokio::test]
async fn roundtrip_struct_iteral() -> Result<()> {
roundtrip("select struct(1 as name0, 3.14 as name1, 'e', true as name3) from data")
.await
}
---- cases::roundtrip_logical_plan::roundtrip_list_iteral stdout ----
thread 'cases::roundtrip_logical_plan::roundtrip_list_iteral' panicked at consumer.rs:1425:41:
index out of bounds: the len is 0 but the index is 0
I think lists and structs are more complex than the virtual table, so it's better for them to have separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was because I wasn't handling empty lists properly (producer created Lists rather than EmptyList) - that was fixed in the Lists PR. Thanks for catching!
.expect("expected schema to exist for virtual table"); | ||
|
||
let fields = from_substrait_named_struct(&base_schema)?; | ||
let schema = DFSchemaRef::new(DFSchema::try_from(Schema::new(fields))?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the schema here may not be necessary, perhaps it can be restored through LogicalPlanBuilder::values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably could, except for the column names - LogicalPlanBuilder::values names everything to just column1
etc. Do you see some benefit to using that (and then handling the column names somehow) instead of passing in the schema explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I once wanted to remove the schema, but we need to use it to pass the struct's names, so the current approach should be fine.
eb98045
to
7ec7d42
Compare
@jonahgao Thanks for the review - I pushed a new version but it builds on top of #10622 and #10640 so will need those merged and this rebased first before this PRs looks clean. This is mostly what it used to be with one more major change: I removed passing the datatype into from_substrait_literal and instead do the same thing with names as I did for from_substrait_type. That maybe seems a bit cleaner (even though I dislike the name handling, but I don't see a better way to deal with it based on how it works in Substrait :/) |
01ca26f
to
908e533
Compare
Adds support for Substrait's VirtualTables, ie. tables with data baked-in into the Substrait plan instead of being read from a source. Adds conversion in both ways (Substrait -> DataFusion and DataFusion -> Substrait) and a roundtrip test.
908e533
to
02e7bc0
Compare
} | ||
Ok(names) | ||
} | ||
DataType::List(l) => names_dfs(l.data_type()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to consider DataType::LargeList
here because to_substrait_type
supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed 9c83ba4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to create a LargeList using VALUES for testing, without writing a billion values in there? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow_cast
is a solution .
VALUES(arrow_cast([1,2], 'LargeList(Int64)'));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I added that in 4b8babe, though it seems that arrow_cast doesn't yet support structs, so the test still wouldn't have caught this. Unless you can think of some workaround? I'm not familiar at all with arrow_cast and the like 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so too, let's keep it as a TODO.
|
||
#[tokio::test] | ||
async fn roundtrip_empty_relation() -> Result<()> { | ||
roundtrip("SELECT * FROM (VALUES (1)) LIMIT 0").await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've slightly modified this test and printed out the schema on the consumer side, but it seems to be incorrect.
#[tokio::test]
async fn roundtrip_empty_relation() -> Result<()> {
roundtrip("SELECT * FROM (VALUES ([STRUCT(1 as a)], 2)) LIMIT 0").await
}
plan2 schema: DFSchema { inner: Schema { fields: [
Field { name: "column1", data_type: List(Field { name: "item", data_type: Struct([Field { name: "c0", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
# Its name should be "column2"
Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} },
field_qualifiers: [None, None], functional_dependencies: FunctionalDependencies { deps: [] } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need a test to verify the schemas. For this test, the plan str on both sides would just be EmptyRelation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed in 24d07c9 and added schema verification to most tests
Also adds roundtrip schema assertions for cases where possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me👍. Thank you for your contribution @Blizzara .
* Add support for Substrait VirtualTables Adds support for Substrait's VirtualTables, ie. tables with data baked-in into the Substrait plan instead of being read from a source. Adds conversion in both ways (Substrait -> DataFusion and DataFusion -> Substrait) and a roundtrip test. * fix clippy * Add support for empty relations * Fix consuming Structs inside Lists and Structs Also adds roundtrip schema assertions for cases where possible * Rename from_substrait_struct -> from_substrait_struct_type for clarity * Add DataType::LargeList to to_substrait_named_struct * cargo fmt --all * Add validation that names list matches schema exactly * Add a LargeList into VALUES test
Which issue does this PR close?
Closes #10530
This (well, now together with the earlier PRs for chunks of this) is my first (but hopefully not last!) contribution to DataFusion, and I'm pretty new with Rust as well - so all feedback on both style and details is very much welcome!
Rationale for this change
What changes are included in this PR?
Adds support for Substrait's VirtualTables, ie. tables with data baked-in into the Substrait plan instead of being read from a source.
Adds conversion in both ways (Substrait -> DataFusion and DataFusion -> Substrait)
Adds support for both empty VirtualTables (mapped into LogicalPlan::EmptyTable) and those containing values (mapped into LogicalPlan::Values).
Are these changes tested?
Tested through round-trip tests, similar to other Substrait support.
Are there any user-facing changes?