-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add visitor for converting kernel expressions to engine expressions #363
base: main
Are you sure you want to change the base?
Add visitor for converting kernel expressions to engine expressions #363
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #363 +/- ##
==========================================
- Coverage 77.68% 75.66% -2.03%
==========================================
Files 49 49
Lines 10084 10354 +270
Branches 10084 10354 +270
==========================================
Hits 7834 7834
- Misses 1805 2075 +270
Partials 445 445 ☔ View full report in Codecov by Sentry. |
FFI code for schema and expressions have been moved to their own modules. Add prototype of the visitor to enable kernel expresison -> engine expression
1814c84
to
57fdf77
Compare
@@ -0,0 +1,78 @@ | |||
And |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicklan @zachschuermann How should I go about testing this? I imagine ffi/tests/something
is a sensible place to put this file. I need some way to pass this file to the executable, but read_table
expects a table input, not this.
Currently I havetest_kernel_expr
pointing directly to ./expression_test_results.txt
, but that makes it so that you can only run the executable from the read-table
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is test_kernel_expr
? maybe a c
file you forgot to add?
I think you could have two binaries compiled in the read-table
example with adding something like:
add_executable(test_expr test_expr.c)
target_compile_definitions(test_expr PUBLIC DEFINE_DEFAULT_ENGINE)
target_include_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/ffi-headers")
target_link_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/debug")
target_link_libraries(test_expr PUBLIC delta_kernel_ffi)
target_compile_options(test_expr PUBLIC)
in the CMakeLists.txt
.
That's not ideal, even better would be a new print_expr
example. There's not much (if anything) you rely on from read_table.h
, so you could just copy over whatever you need into a new header. lmk if you need help with getting that set up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that I removed the expression test in the C code for now since that breaks builds. It currently Works On My Machine ™, but I'd like to make it part of CI if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw your msg. I'll get a second binary setup 👍
/// The caller is responsible for freeing the retured memory, either by calling | ||
/// [`free_kernel_predicate`], or [`Handle::drop_handle`] | ||
#[no_mangle] | ||
pub unsafe extern "C" fn get_kernel_expression() -> Handle<SharedExpression> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently have this C function that returns a big expression for testing. How can I limit visibility while still passing this over to something like read_table
?
ffi/src/expressions.rs
Outdated
/// Visit a `struct` which is constructed from an ordered list of expressions. This declares | ||
/// the number of expressions that the struct will be made of. The visitor will populate the | ||
/// list of expressions using the [`visit_struct_sub_expr`] method. | ||
pub visit_struct: extern "C" fn(data: *mut c_void, len: usize) -> usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is visiting a struct made up of a Vec<Expression>
. How's the naming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd call it visit_struct_expr
to make it clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing some comments
/// The caller is responsible for freeing the retured memory, either by calling | ||
/// [`free_kernel_predicate`], or [`Handle::drop_handle`] | ||
#[no_mangle] | ||
pub unsafe extern "C" fn get_kernel_expression() -> Handle<SharedExpression> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really just testing code right? At the very least I'd name it get_testing_kernel_expr
and ideally we'd hide it behind a #[cfg(test)]
but I'm not sure if we can easily have it then make it into the library. Maybe we make a new module that requires a feature and exposes these kinds of "test" functions?
ffi/src/expressions.rs
Outdated
} | ||
fn visit_expression(visitor: &mut EngineExpressionVisitor, expression: &Expression) -> usize { | ||
match expression { | ||
Expression::Literal(scalar) => visit_scalar(visitor, scalar), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're just blasting through the Literal
expression here without giving the engine any indication that that's what the visited scalar is associated with. If we ever want to use scalars embedded in some other expression that could be an issue.
I think we should probably add a visit_literal(scalar_id: usize)
. If engines want to to just "ignore" that they can just return scalar_id;
. Alternately we could rename all the visit_int
type expressions to visit_int_literal
to make it more clear. That's maybe easier for the ffi right now, but would be less flexible in the future. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I can add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which? i'm not totally clear which is best :)
/// to visitor methods | ||
/// TODO: Add type information in struct field and null. This will likely involve using the schema visitor. | ||
#[repr(C)] | ||
pub struct EngineExpressionVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhh, wow, that's a lot of visitors :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yessir! Lots of hours and thought, apologies for the big PR lol
ffi/examples/read-table/read_table.c
Outdated
@@ -3,6 +3,7 @@ | |||
#include <string.h> | |||
|
|||
#include "arrow.h" | |||
#include "expression.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think you need this
@@ -0,0 +1,78 @@ | |||
And |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is test_kernel_expr
? maybe a c
file you forgot to add?
I think you could have two binaries compiled in the read-table
example with adding something like:
add_executable(test_expr test_expr.c)
target_compile_definitions(test_expr PUBLIC DEFINE_DEFAULT_ENGINE)
target_include_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/ffi-headers")
target_link_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/debug")
target_link_libraries(test_expr PUBLIC delta_kernel_ffi)
target_compile_options(test_expr PUBLIC)
in the CMakeLists.txt
.
That's not ideal, even better would be a new print_expr
example. There's not much (if anything) you rely on from read_table.h
, so you could just copy over whatever you need into a new header. lmk if you need help with getting that set up
float float_data; | ||
double double_data; | ||
bool boolean_data; | ||
struct KernelStringSlice string_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this to be a KernelStringSlice
, just make it a char*
. Get rid of copy_kernel_string()
and just allocate a c-string instead, like:
// utility to turn a slice into a char*
char* allocate_string(const KernelStringSlice slice)
{
return strndup(slice.ptr, slice.len);
}
ffi/examples/read-table/expression.h
Outdated
} ExpressionBuilder; | ||
struct Struct | ||
{ | ||
KernelStringSlice* field_names; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below, just make this a char**
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks, lots of work!
One major comment, I think it might be better to model this the same we do schemas where each visit_X
call gets a sibling_list_id
which tells it which list of expressions it should append itself to. Then you don't need all the visit_struct_sub_expr
methods, you can just do something like:
fn visit_struct(visitor: &mut EngineExpressionVisitor, exprs: &Vec<Expression>) -> usize {
let child_list_id = visitor.make_expr_list(exprs.len());
for expr in exprs {
visit_expression(visitor, expr, child_list_id);
}
visit_struct(visitor.data, child_list_id)
}
There is a subtle difference that visit_schema
returns a list id at the end because we know a schema is always a list of struct fields at the top. Here we could be visiting just one expression, but we could just define that the top level always returns a single element list, or figure out how to return the top level expr id directly.
I realize that's a somewhat big change so open to pushback :)
@@ -0,0 +1,751 @@ | |||
#include "assert.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include "assert.h" | |
#include <assert.h> |
Expr::binary( | ||
BinaryOperator::In, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::Plus, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::Minus, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::Equal, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::NotEqual, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::NotIn, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::Divide, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::Multiply, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::LessThan, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::LessThanOrEqual, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::GreaterThan, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::GreaterThanOrEqual, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), | ||
Expr::binary( | ||
BinaryOperator::Distinct, | ||
Expr::literal(Scalar::Integer(0)), | ||
Expr::literal(Scalar::Long(0)), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to the end and then do something like:
vec.extend(
[
BinaryOperator::In,
BinaryOperator::Plus,
BinaryOperator::Minus,
BinaryOperator::Equal,
BinaryOperator::NotEqual,
BinaryOperator::NotIn,
BinaryOperator::Divide,
BinaryOperator::Multiply,
BinaryOperator::LessThan,
BinaryOperator::LessThanOrEqual,
BinaryOperator::GreaterThan,
BinaryOperator::GreaterThanOrEqual,
BinaryOperator::Distinct,
]
.iter()
.map(|op| {
Expr::binary(
op,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
)
}),
)
ffi/src/expressions.rs
Outdated
/// Visit a `struct` which is constructed from an ordered list of expressions. This declares | ||
/// the number of expressions that the struct will be made of. The visitor will populate the | ||
/// list of expressions using the [`visit_struct_sub_expr`] method. | ||
pub visit_struct: extern "C" fn(data: *mut c_void, len: usize) -> usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd call it visit_struct_expr
to make it clear
ffi/src/expressions.rs
Outdated
($visitor.$visitor_fn)($visitor.data $(, $extra_args) *) | ||
}; | ||
} | ||
fn visit_array(visitor: &mut EngineExpressionVisitor, array: &ArrayData) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are functions defined in an unsafe
function themselves unsafe
, or are they safe? :)
}; | ||
} | ||
fn visit_array(visitor: &mut EngineExpressionVisitor, array: &ArrayData) -> usize { | ||
#[allow(deprecated)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
ffi/src/expressions.rs
Outdated
} | ||
fn visit_expression(visitor: &mut EngineExpressionVisitor, expression: &Expression) -> usize { | ||
match expression { | ||
Expression::Literal(scalar) => visit_scalar(visitor, scalar), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which? i'm not totally clear which is best :)
ffi/examples/read-table/expression.h
Outdated
typedef struct | ||
{ | ||
size_t len; | ||
ExpressionRef handles[100]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's prefer not to call these handles
, which has a very specific meaning from the rust handle.rs
stuff. expr_refs
maybe? Change names below as well.
Also probably best to just make it an ExpressionRef *
and realloc
it if you need it to grow. Otherwise you have a lurking bug below where >100 expressions will be bad news.
ffi/examples/read-table/expression.h
Outdated
uintptr_t schema_list_id = visit_expression(&predicate, &visitor); | ||
return data.handles[schema_list_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uintptr_t schema_list_id = visit_expression(&predicate, &visitor); | |
return data.handles[schema_list_id]; | |
uintptr_t expr_id = visit_expression(&predicate, &visitor); | |
return data.handles[expr_id]; |
Yeah it didn't make sense to me initially to have a one-size top level/one-size array for unary. But this cuts down on functions for visiting complex types, and fewer functions >> avoiding these allocations. I went ahead and implemented sibling_id based. Seems to pass the test, but I need to clean up a bit. This could also fit nicely with the schema code for visiting structs and nulls with |
Closes: #358
Depends on: #360