Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remaining data types to schema visiting in ffi #187

Merged
merged 29 commits into from
May 10, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
1d5a9dd
Add most types other than a few less common primitive ones
nicklan May 1, 2024
32428a5
add remaining types
nicklan May 1, 2024
7f59e54
include pragma once
nicklan May 2, 2024
bcc5a4f
include schema printing
nicklan May 2, 2024
9593038
docs mostly
nicklan May 2, 2024
6fd15cf
fmt
nicklan May 2, 2024
4baa68a
free schema!
nicklan May 2, 2024
8747fe3
cleanup
nicklan May 2, 2024
6c4a173
fix pointer issue
nicklan May 6, 2024
87ef58c
strndup + macro
nicklan May 6, 2024
21a3402
Apply suggestions from code review
nicklan May 6, 2024
78378b8
Merge branch 'main' into add-most-schema-types-to-ffi
nicklan May 7, 2024
0fc9bd7
don't forget long
nicklan May 7, 2024
a3ff414
untabify :sigh:
nicklan May 7, 2024
03c34db
Merge branch 'main' into add-most-schema-types-to-ffi
nicklan May 7, 2024
184e2c3
fix up selection vector stuff
nicklan May 7, 2024
0805f38
maps can have nulls + print if array/map has null in schema.h
nicklan May 7, 2024
76b942b
cleaner visit macro
nicklan May 8, 2024
245da8a
comment update
nicklan May 8, 2024
d479631
clearer todo
nicklan May 8, 2024
872b0e7
fix schema printing for deep nesting
nicklan May 8, 2024
47abc39
use schema appropriate names
nicklan May 8, 2024
84af4ba
Merge branch 'main' into add-most-schema-types-to-ffi
nicklan May 8, 2024
cd86766
Merge branch 'main' into add-most-schema-types-to-ffi
nicklan May 10, 2024
93fc17c
macros are fun
nicklan May 10, 2024
711b16b
Apply suggestions from code review
nicklan May 10, 2024
953c3e8
doc update
nicklan May 10, 2024
2498d10
slightly more consistent formatting
nicklan May 10, 2024
8cdf7f8
more consistent array indexing
nicklan May 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 61 additions & 93 deletions ffi/examples/read-table/schema.h
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#include <stdint.h>
#include "delta_kernel_ffi.h"

/**
Expand All @@ -8,21 +9,21 @@
* Each list is a "SchemaItemList", which tracks its length an an array of "SchemaItem"s.
*
* Each "SchemaItem" has a name and a type, which are just strings. It can also have a list which is
* its children. This is initially always NULL, but when visiting a struct, map, or array, we point
* this at the list specified in the callback, which allows us to traverse the schema when printing
* it. Note that this points to one of the lists in the builder's set of lists and is not a copy.
* its children. This is initially always UINTPTR_MAX, but when visiting a struct, map, or array, we
* point this at the list id specified in the callback, which allows us to traverse the schema when
* printing it.
*/

// If you want the visitor to print out what it's being asked to do at each step, uncomment the
// following line
// #define PRINT_VISITS
//#define PRINT_VISITS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could also potentially leverage variadic macro args for this:

#ifdef VERBOSE
#define PRINT_VISITS(...) printf(__VA_ARGS__)
#else
#define PRINT_VISIT(...)
#endif

and then just use it:

PRINT_VISIT("Making a list of length %i with id %i\n", reserve, id);

(a bit more work up front, but less #ifdef magic at use sites)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also more fun :) switched to that


typedef struct SchemaItemList SchemaItemList;

typedef struct {
char* name;
char* type;
SchemaItemList* children;
uintptr_t children;
} SchemaItem;

typedef struct SchemaItemList {
Expand All @@ -36,9 +37,7 @@ typedef struct {
} SchemaBuilder;

char* allocate_name(const KernelStringSlice slice) {
char* buf = malloc(sizeof(char) * (slice.len + 1)); // +1 for null
snprintf(buf, slice.len + 1, "%s", slice.ptr);
return buf;
return strndup(slice.ptr, slice.len);
}

// lists are preallocated to have exactly enough space, so we just fill in the next open slot and
Expand All @@ -52,42 +51,40 @@ SchemaItem* add_to_list(SchemaItemList *list, char* name, char* type) {
}

// print out all items in a list, recursing into any children they may have
void print_list(SchemaItemList *list, int indent, bool parent_on_last) {
void print_list(SchemaBuilder* builder, uintptr_t list_id, int indent, bool parent_on_last) {
SchemaItemList *list = builder->lists+list_id;
nicklan marked this conversation as resolved.
Show resolved Hide resolved
for (int i = 0; i < list->len; i++) {
bool is_last = i == list->len - 1;
for (int j = 0; j <= indent; j++) {
if (j == indent) {
if (is_last) {
printf("└");
} else {
printf("├");
}
for (int j = 0; j < indent; j++) {
if (parent_on_last && j == indent - 1) {
// don't print a dangling | on my parent's last item
printf(" ");
} else {
if (parent_on_last && j == indent - 1) {
// don't print a dangling | on my parent's last item
printf(" ");
} else {
printf("│ ");
}
printf("│ ");
}
}
printf("─ %s: %s\n", list->list[i].name, list->list[i].type);
if (list->list[i].children) {
print_list(list->list[i].children, indent+1, is_last);
SchemaItem* item = &list->list[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably decide whether to use &array[index] vs. array + index and be consistent?

(I personally prefer the former because it's obvious at a glance that it's not normal arithmetic... but I recognize it's also a bit more typing)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. I tend to use array + index when I know I just want the pointer, and array[index] when I want to use the item there (like array[index].name), but consistency is good. I've updated to all &array[index] style for getting pointers.

char* prefix = is_last? "└" : "├";
printf("%s─ %s: %s\n", prefix, item->name, item->type);
if (list->list[i].children != UINTPTR_MAX) {
print_list(builder, list->list[i].children, indent+1, is_last);
}
}
}

// declare all our visitor methods
uintptr_t make_field_list(void *data, uintptr_t reserve) {
SchemaBuilder *builder = (SchemaBuilder*)data;
SchemaBuilder *builder = data;
int id = builder->list_count;
#ifdef PRINT_VISITS
printf("Making a list of lenth %i with id %i\n", reserve, id);
#endif
builder->list_count++;
builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why realloc, out of curiosity? What previous usage could there have been?

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, this is being used to grow the array... but realloc often (usually?) moves the memory to a new location, which would invalidate all the child pointers. To avoid dangling pointers, SchemaItem::children needs to be an index into SchemaBuilder::lists. Doing that would also make it obvious in the free_builder function, why we don't need to free the child list when freeing its parent.

Copy link
Collaborator Author

@nicklan nicklan May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, real bad on the pointers 🤦🏽 .

I've updated things to use indexes.

Why realloc, out of curiosity? What previous usage could there have been?

We don't know the total number of lists ahead of time. SchemaBuilder.lists is an array of SchemaItemList, so this realloc is just adding one more to it (list_count gets incremented before this realloc call). Previous usage was when we made the builder and did

  SchemaBuilder builder = {
    .list_count = 0,
    .lists = calloc(0, sizeof(SchemaItem*)),
  };

as well as each other call to this method.

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu): "Unless ptr is NULL, it must have been returned by an earlier call to malloc or related functions. If the area pointed to was moved, a free(ptr) is done."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu)

I was referring to the lists of SchemaItem that each SchemaItemList in the builder referred to. But it's not a problem as you say, because we're doing realloc to extend rather than replace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One gotcha with using realloc this way -- it has O(n**2) cost, where n is the final list_count. The usual workaround is to increase the capacity of the array by a fixed ratio like 3/2; in return for potentially 50% wasted space, it is amortized O(n) cost. Probably doesn't matter for this example code tho.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I considered increasing more each time, but:
a) it requires tracking an extra capacity param and
b) for example code as you say it's probably overkill

SchemaItem* list = calloc(reserve, sizeof(SchemaItem));
for (int i = 0; i < reserve; i++) {
list[i].children = UINTPTR_MAX;
}
builder->lists[id].len = 0;
builder->lists[id].list = list;
return id;
Expand All @@ -97,38 +94,38 @@ void visit_struct(void *data,
uintptr_t sibling_list_id,
struct KernelStringSlice name,
uintptr_t child_list_id) {
SchemaBuilder *builder = data;
char* name_ptr = allocate_name(name);
#ifdef PRINT_VISITS
printf("Asked to visit a struct, belonging to list %i. Children are in %i\n", sibling_list_id, child_list_id);
printf("Asked to visit a struct, belonging to list %i for %s. Children are in %i\n", sibling_list_id, name_ptr, child_list_id);
#endif
SchemaBuilder *builder = (SchemaBuilder*)data;
char* name_ptr = allocate_name(name);
SchemaItem* struct_item = add_to_list(builder->lists+sibling_list_id, name_ptr, "struct");
struct_item->children = builder->lists+child_list_id;
struct_item->children = child_list_id;
}
void visit_array(void *data,
uintptr_t sibling_list_id,
struct KernelStringSlice name,
bool contains_null,
uintptr_t child_list_id) {
SchemaBuilder *builder = data;
char* name_ptr = allocate_name(name);
#ifdef PRINT_VISITS
printf("Asked to visit array, belonging to list %i. Types are in %i\n", sibling_list_id, child_list_id);
printf("Asked to visit array, belonging to list %i for %s. Types are in %i\n", sibling_list_id, name_ptr, child_list_id);
#endif
SchemaBuilder *builder = (SchemaBuilder*)data;
char* name_ptr = allocate_name(name);
SchemaItem* array_item = add_to_list(builder->lists+sibling_list_id, name_ptr, "array");
array_item->children = builder->lists+child_list_id;
array_item->children = child_list_id;
}
void visit_map(void *data,
uintptr_t sibling_list_id,
struct KernelStringSlice name,
uintptr_t child_list_id) {
SchemaBuilder *builder = data;
char* name_ptr = allocate_name(name);
#ifdef PRINT_VISITS
printf("Asked to visit map, belonging to list %i. Types are in %i\n", sibling_list_id, child_list_id);
printf("Asked to visit map, belonging to list %i for %s. Types are in %i\n", sibling_list_id, name_ptr, child_list_id);
#endif
SchemaBuilder *builder = (SchemaBuilder*)data;
char* name_ptr = allocate_name(name);
SchemaItem* map_item = add_to_list(builder->lists+sibling_list_id, name_ptr, "map");
map_item->children = builder->lists+child_list_id;
map_item->children = child_list_id;
}

void visit_decimal(void *data,
Expand All @@ -139,70 +136,43 @@ void visit_decimal(void *data,
#ifdef PRINT_VISITS
printf("Asked to visit decimal with precision %i and scale %i, belonging to list %i\n", sibling_list_id);
#endif
SchemaBuilder *builder = (SchemaBuilder*)data;
SchemaBuilder *builder = data;
char* name_ptr = allocate_name(name);
char* type = malloc(16 * sizeof(char));
sprintf(type, "decimal(%i)(%i)", precision, scale);
snprintf(type, 16, "decimal(%i)(%i)", precision, scale);
add_to_list(builder->lists+sibling_list_id, name_ptr, type);
}



void visit_simple_type(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name, char* type) {
SchemaBuilder *builder = data;
char* name_ptr = allocate_name(name);
#ifdef PRINT_VISITS
printf("Asked to visit a(n) %s belonging to list %i\n", type, sibling_list_id);
printf("Asked to visit a(n) %s belonging to list %i for %s\n", type, sibling_list_id, name_ptr);
#endif
SchemaBuilder *builder = (SchemaBuilder*)data;
char* name_ptr = allocate_name(name);
add_to_list(builder->lists+sibling_list_id, name_ptr, type);
}

void visit_string(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "string");
}

void visit_long(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "long");
}

void visit_integer(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "integer");
}

void visit_short(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "short");
}

void visit_byte(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "byte");
}
void visit_float(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "float");
}

void visit_double(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "double");
}

void visit_boolean(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "boolean");
}

void visit_binary(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "binary");
}

void visit_date(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "date");
}

void visit_timestamp(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "timestamp");
}

void visit_timestamp_ntz(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) {
visit_simple_type(data, sibling_list_id, name, "timestamp_ntz");
}
#define DEFINE_VISIT_SIMPLE_TYPE(typename) \
void visit_##typename(void *data, uintptr_t sibling_list_id, struct KernelStringSlice name) { \
visit_simple_type(data, sibling_list_id, name, #typename); \
}

// free all the data in the builder (but not the builder itself, since that might be stack allocated)
DEFINE_VISIT_SIMPLE_TYPE(string);
DEFINE_VISIT_SIMPLE_TYPE(integer);
DEFINE_VISIT_SIMPLE_TYPE(short);
DEFINE_VISIT_SIMPLE_TYPE(byte);
DEFINE_VISIT_SIMPLE_TYPE(long);
DEFINE_VISIT_SIMPLE_TYPE(float);
DEFINE_VISIT_SIMPLE_TYPE(double);
DEFINE_VISIT_SIMPLE_TYPE(boolean);
DEFINE_VISIT_SIMPLE_TYPE(binary);
DEFINE_VISIT_SIMPLE_TYPE(date);
DEFINE_VISIT_SIMPLE_TYPE(timestamp);
DEFINE_VISIT_SIMPLE_TYPE(timestamp_ntz);

// free all the data in the builder (but not the builder itself, it's stack allocated)
void free_builder(SchemaBuilder builder) {
scovich marked this conversation as resolved.
Show resolved Hide resolved
for (int i = 0; i < builder.list_count; i++) {
SchemaItemList *list = (builder.lists)+i;
Expand All @@ -214,8 +184,6 @@ void free_builder(SchemaBuilder builder) {
// except decimal types, we malloc'd those :)
free(item->type);
}
// don't free item->children, it's just a list in the builder so will be freed by the outer
// loop.
}
free(list->list); // free all the items in this list (we alloc'd them together)
}
Expand Down Expand Up @@ -253,7 +221,7 @@ void print_schema(const SnapshotHandle *snapshot) {
printf("Schema returned in list %i\n", schema_list_id);
#endif
printf("Schema:\n");
print_list(builder.lists+schema_list_id, 0, false);
print_list(&builder, schema_list_id, 0, false);
printf("\n");
free_builder(builder);
}
8 changes: 4 additions & 4 deletions ffi/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,7 @@ unsafe fn unwrap_and_parse_path_as_url(path: KernelStringSlice) -> DeltaResult<U
}

/// A builder that allows setting options on the `Engine` before actually building it
#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
pub struct EngineBuilder {
url: Url,
allocate_fn: AllocateErrorFn,
Expand Down Expand Up @@ -586,14 +587,13 @@ pub unsafe extern "C" fn version(snapshot: *const SnapshotHandle) -> u64 {
/// its schema it will do the following at each node:
///
/// - For a struct, map, or array type:
/// 1. Ask engine to allocate a list, specifying enough space to hold associated schema
/// information.
/// 1. Ask engine to allocate a "child" list, indicating how many items will be needed.
/// 2. For a struct, visit each child, with `sibling_list_id` set to the just allocated list.
/// 3. For a map, visit the key type and the value type with `sibling_list_id` set to the just
/// allocated list.
/// 4. For an array, visit the item type with `sibling_list_id` set to the just allocated list.
/// 5. Finally call `visit_[struct/array/map]` with `child_list_id` set to the list allocated
/// in `1`. This has enough information to build the entire type on the engine side.
/// in step 1. This has enough information to build the entire type on the engine side.
///
/// - For a simple type simply call `visit_[type]`, with `sibling_list_id` set to the `id` of the
/// list the type should be added to.
Expand All @@ -609,7 +609,7 @@ pub struct EngineSchemaVisitor {
pub make_field_list: extern "C" fn(data: *mut c_void, reserve: usize) -> usize,

// visitor methods that should instantiate and append the appropriate type to the field list
/// Indidate that the schema contains a `Struct` type. The top level of a Schema is always a
/// Indicate that the schema contains a `Struct` type. The top level of a Schema is always a
/// `Struct`. The children of the `Struct` are in the list identified by `child_list_id`.
nicklan marked this conversation as resolved.
Show resolved Hide resolved
pub visit_struct: extern "C" fn(
data: *mut c_void,
Expand Down
2 changes: 1 addition & 1 deletion ffi/src/scan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ pub struct ArrowFFIData {
///
/// # Safety
/// data_handle must be a valid EngineDataHandle as read by the
/// [`delta_kernel::engine::default::DefaultEngine`] obtained from `get_default_client`.
/// [`delta_kernel::engine::default::DefaultEngine`] obtained from `get_default_engine`.
#[cfg(feature = "default-engine")]
pub unsafe extern "C" fn get_raw_arrow_data(
data_handle: *mut EngineDataHandle,
Expand Down
Loading