Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add visitor for converting kernel expressions to engine expressions #363

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Sep 27, 2024

Closes: #358
Depends on: #360

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 0.36900% with 270 lines in your changes missing coverage. Please review.

Project coverage is 75.66%. Comparing base (6b0df5d) to head (c8f80a3).

Files with missing lines Patch % Lines
ffi/src/expressions.rs 0.00% 270 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #363      +/-   ##
==========================================
- Coverage   77.68%   75.66%   -2.03%     
==========================================
  Files          49       49              
  Lines       10084    10354     +270     
  Branches    10084    10354     +270     
==========================================
  Hits         7834     7834              
- Misses       1805     2075     +270     
  Partials      445      445              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ffi/src/expressions.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 9, 2024
@OussamaSaoudi-db OussamaSaoudi-db changed the title [WIP] Add visitor for converting kernel expressions to engine expressions Add visitor for converting kernel expressions to engine expressions Oct 10, 2024
@@ -0,0 +1,78 @@
And
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan @zachschuermann How should I go about testing this? I imagine ffi/tests/something is a sensible place to put this file. I need some way to pass this file to the executable, but read_table expects a table input, not this.

Currently I havetest_kernel_expr pointing directly to ./expression_test_results.txt, but that makes it so that you can only run the executable from the read-table directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is test_kernel_expr? maybe a c file you forgot to add?

I think you could have two binaries compiled in the read-table example with adding something like:

add_executable(test_expr test_expr.c)
target_compile_definitions(test_expr PUBLIC DEFINE_DEFAULT_ENGINE)
target_include_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/ffi-headers")
target_link_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/debug")
target_link_libraries(test_expr PUBLIC delta_kernel_ffi)
target_compile_options(test_expr PUBLIC)

in the CMakeLists.txt.

That's not ideal, even better would be a new print_expr example. There's not much (if anything) you rely on from read_table.h, so you could just copy over whatever you need into a new header. lmk if you need help with getting that set up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that I removed the expression test in the C code for now since that breaks builds. It currently Works On My Machine ™, but I'd like to make it part of CI if possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw your msg. I'll get a second binary setup 👍

@OussamaSaoudi-db OussamaSaoudi-db marked this pull request as ready for review October 10, 2024 23:10
/// The caller is responsible for freeing the retured memory, either by calling
/// [`free_kernel_predicate`], or [`Handle::drop_handle`]
#[no_mangle]
pub unsafe extern "C" fn get_kernel_expression() -> Handle<SharedExpression> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently have this C function that returns a big expression for testing. How can I limit visibility while still passing this over to something like read_table?

/// Visit a `struct` which is constructed from an ordered list of expressions. This declares
/// the number of expressions that the struct will be made of. The visitor will populate the
/// list of expressions using the [`visit_struct_sub_expr`] method.
pub visit_struct: extern "C" fn(data: *mut c_void, len: usize) -> usize,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is visiting a struct made up of a Vec<Expression>. How's the naming?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd call it visit_struct_expr to make it clear

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing some comments

/// The caller is responsible for freeing the retured memory, either by calling
/// [`free_kernel_predicate`], or [`Handle::drop_handle`]
#[no_mangle]
pub unsafe extern "C" fn get_kernel_expression() -> Handle<SharedExpression> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really just testing code right? At the very least I'd name it get_testing_kernel_expr and ideally we'd hide it behind a #[cfg(test)] but I'm not sure if we can easily have it then make it into the library. Maybe we make a new module that requires a feature and exposes these kinds of "test" functions?

}
fn visit_expression(visitor: &mut EngineExpressionVisitor, expression: &Expression) -> usize {
match expression {
Expression::Literal(scalar) => visit_scalar(visitor, scalar),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just blasting through the Literal expression here without giving the engine any indication that that's what the visited scalar is associated with. If we ever want to use scalars embedded in some other expression that could be an issue.

I think we should probably add a visit_literal(scalar_id: usize). If engines want to to just "ignore" that they can just return scalar_id;. Alternately we could rename all the visit_int type expressions to visit_int_literal to make it more clear. That's maybe easier for the ffi right now, but would be less flexible in the future. thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I can add that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which? i'm not totally clear which is best :)

/// to visitor methods
/// TODO: Add type information in struct field and null. This will likely involve using the schema visitor.
#[repr(C)]
pub struct EngineExpressionVisitor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhh, wow, that's a lot of visitors :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yessir! Lots of hours and thought, apologies for the big PR lol

@@ -3,6 +3,7 @@
#include <string.h>

#include "arrow.h"
#include "expression.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think you need this

@@ -0,0 +1,78 @@
And
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is test_kernel_expr? maybe a c file you forgot to add?

I think you could have two binaries compiled in the read-table example with adding something like:

add_executable(test_expr test_expr.c)
target_compile_definitions(test_expr PUBLIC DEFINE_DEFAULT_ENGINE)
target_include_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/ffi-headers")
target_link_directories(test_expr PUBLIC "${CMAKE_CURRENT_SOURCE_DIR}/../../../target/debug")
target_link_libraries(test_expr PUBLIC delta_kernel_ffi)
target_compile_options(test_expr PUBLIC)

in the CMakeLists.txt.

That's not ideal, even better would be a new print_expr example. There's not much (if anything) you rely on from read_table.h, so you could just copy over whatever you need into a new header. lmk if you need help with getting that set up

float float_data;
double double_data;
bool boolean_data;
struct KernelStringSlice string_data;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this to be a KernelStringSlice, just make it a char*. Get rid of copy_kernel_string() and just allocate a c-string instead, like:

// utility to turn a slice into a char*
char* allocate_string(const KernelStringSlice slice)
{
  return strndup(slice.ptr, slice.len);
}

} ExpressionBuilder;
struct Struct
{
KernelStringSlice* field_names;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, just make this a char**.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks, lots of work!

One major comment, I think it might be better to model this the same we do schemas where each visit_X call gets a sibling_list_id which tells it which list of expressions it should append itself to. Then you don't need all the visit_struct_sub_expr methods, you can just do something like:

fn visit_struct(visitor: &mut EngineExpressionVisitor, exprs: &Vec<Expression>) -> usize {
  let child_list_id = visitor.make_expr_list(exprs.len());
  for expr in exprs {
    visit_expression(visitor, expr, child_list_id);
  }
  visit_struct(visitor.data, child_list_id)
}

There is a subtle difference that visit_schema returns a list id at the end because we know a schema is always a list of struct fields at the top. Here we could be visiting just one expression, but we could just define that the top level always returns a single element list, or figure out how to return the top level expr id directly.

I realize that's a somewhat big change so open to pushback :)

@@ -0,0 +1,751 @@
#include "assert.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "assert.h"
#include <assert.h>

Comment on lines +318 to +382
Expr::binary(
BinaryOperator::In,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::Plus,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::Minus,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::Equal,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::NotEqual,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::NotIn,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::Divide,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::Multiply,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::LessThan,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::LessThanOrEqual,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::GreaterThan,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::GreaterThanOrEqual,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Expr::binary(
BinaryOperator::Distinct,
Expr::literal(Scalar::Integer(0)),
Expr::literal(Scalar::Long(0)),
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the end and then do something like:

    vec.extend(
        [
            BinaryOperator::In,
            BinaryOperator::Plus,
            BinaryOperator::Minus,
            BinaryOperator::Equal,
            BinaryOperator::NotEqual,
            BinaryOperator::NotIn,
            BinaryOperator::Divide,
            BinaryOperator::Multiply,
            BinaryOperator::LessThan,
            BinaryOperator::LessThanOrEqual,
            BinaryOperator::GreaterThan,
            BinaryOperator::GreaterThanOrEqual,
            BinaryOperator::Distinct,
        ]
        .iter()
        .map(|op| {
            Expr::binary(
                op,
                Expr::literal(Scalar::Integer(0)),
                Expr::literal(Scalar::Long(0)),
            )
        }),
    )

/// Visit a `struct` which is constructed from an ordered list of expressions. This declares
/// the number of expressions that the struct will be made of. The visitor will populate the
/// list of expressions using the [`visit_struct_sub_expr`] method.
pub visit_struct: extern "C" fn(data: *mut c_void, len: usize) -> usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd call it visit_struct_expr to make it clear

($visitor.$visitor_fn)($visitor.data $(, $extra_args) *)
};
}
fn visit_array(visitor: &mut EngineExpressionVisitor, array: &ArrayData) -> usize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are functions defined in an unsafe function themselves unsafe, or are they safe? :)

};
}
fn visit_array(visitor: &mut EngineExpressionVisitor, array: &ArrayData) -> usize {
#[allow(deprecated)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

ffi/src/expressions.rs Outdated Show resolved Hide resolved
}
fn visit_expression(visitor: &mut EngineExpressionVisitor, expression: &Expression) -> usize {
match expression {
Expression::Literal(scalar) => visit_scalar(visitor, scalar),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which? i'm not totally clear which is best :)

typedef struct
{
size_t len;
ExpressionRef handles[100];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's prefer not to call these handles, which has a very specific meaning from the rust handle.rs stuff. expr_refs maybe? Change names below as well.

Also probably best to just make it an ExpressionRef * and realloc it if you need it to grow. Otherwise you have a lurking bug below where >100 expressions will be bad news.

Comment on lines 422 to 423
uintptr_t schema_list_id = visit_expression(&predicate, &visitor);
return data.handles[schema_list_id];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uintptr_t schema_list_id = visit_expression(&predicate, &visitor);
return data.handles[schema_list_id];
uintptr_t expr_id = visit_expression(&predicate, &visitor);
return data.handles[expr_id];

@OussamaSaoudi-db
Copy link
Collaborator Author

Nice thanks, lots of work!

One major comment, I think it might be better to model this the same we do schemas where each visit_X call gets a sibling_list_id which tells it which list of expressions it should append itself to. Then you don't need all the visit_struct_sub_expr methods, you can just do something like:

fn visit_struct(visitor: &mut EngineExpressionVisitor, exprs: &Vec<Expression>) -> usize {
  let child_list_id = visitor.make_expr_list(exprs.len());
  for expr in exprs {
    visit_expression(visitor, expr, child_list_id);
  }
  visit_struct(visitor.data, child_list_id)
}

There is a subtle difference that visit_schema returns a list id at the end because we know a schema is always a list of struct fields at the top. Here we could be visiting just one expression, but we could just define that the top level always returns a single element list, or figure out how to return the top level expr id directly.

I realize that's a somewhat big change so open to pushback :)

Yeah it didn't make sense to me initially to have a one-size top level/one-size array for unary. But this cuts down on functions for visiting complex types, and fewer functions >> avoiding these allocations. I went ahead and implemented sibling_id based. Seems to pass the test, but I need to clean up a bit.

This could also fit nicely with the schema code for visiting structs and nulls with DataType.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Development

Successfully merging this pull request may close these issues.

Add visitor for converting kernel expressions to engine expressions
3 participants