Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remaining data types to schema visiting in ffi #187

Merged
merged 29 commits into from
May 10, 2024

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented May 1, 2024

Adds more types, and uses a macro to reduce repetition for the primitive ones. Also adds some rust docs to try and explain how the visitors for schema work.

Tested by adding a visitor to the read_table program that builds and prints the schema.

Output:

.$ /read_table file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/nested_types/delta/
Reading table at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/nested_types/delta/
version: 0

Schema:
├─ pk: integer
├─ struct: struct
│  ├─ float64: double
│  └─ bool: boolean
├─ array: array
│  └─ array_data_type: short
└─ map: map
   ├─ map_key_type: string
   └─ map_value_type: integer

Got some data
  Of this data, here is a selection vector
    sel[0] = 0
    sel[1] = 0
    sel[2] = 0
    sel[3] = 1
called back to actually read!
  path: part-00000-3b8ca3f0-6444-4ed5-8961-c605cba95bf1-c000.snappy.parquet
  No selection vector for this call
  no partition here
Iterator done
All done

@nicklan nicklan changed the title Add most types other than a few less common primitive ones Add remaining data types May 1, 2024
@nicklan nicklan changed the title Add remaining data types Add remaining data types to schema visiting in ffi May 1, 2024
@nicklan nicklan force-pushed the add-most-schema-types-to-ffi branch 6 times, most recently from 86534d7 to 1aecefd Compare May 2, 2024 23:17
@nicklan nicklan requested review from scovich and samansmink May 2, 2024 23:18
@nicklan nicklan force-pushed the add-most-schema-types-to-ffi branch from 1aecefd to 2f475b4 Compare May 3, 2024 20:32
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

ffi/examples/read-table/schema.h Outdated Show resolved Hide resolved
ffi/examples/read-table/schema.h Outdated Show resolved Hide resolved
ffi/examples/read-table/schema.h Outdated Show resolved Hide resolved
printf("Making a list of lenth %i with id %i\n", reserve, id);
#endif
builder->list_count++;
builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why realloc, out of curiosity? What previous usage could there have been?

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, this is being used to grow the array... but realloc often (usually?) moves the memory to a new location, which would invalidate all the child pointers. To avoid dangling pointers, SchemaItem::children needs to be an index into SchemaBuilder::lists. Doing that would also make it obvious in the free_builder function, why we don't need to free the child list when freeing its parent.

Copy link
Collaborator Author

@nicklan nicklan May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, real bad on the pointers 🤦🏽 .

I've updated things to use indexes.

Why realloc, out of curiosity? What previous usage could there have been?

We don't know the total number of lists ahead of time. SchemaBuilder.lists is an array of SchemaItemList, so this realloc is just adding one more to it (list_count gets incremented before this realloc call). Previous usage was when we made the builder and did

  SchemaBuilder builder = {
    .list_count = 0,
    .lists = calloc(0, sizeof(SchemaItem*)),
  };

as well as each other call to this method.

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu): "Unless ptr is NULL, it must have been returned by an earlier call to malloc or related functions. If the area pointed to was moved, a free(ptr) is done."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu)

I was referring to the lists of SchemaItem that each SchemaItemList in the builder referred to. But it's not a problem as you say, because we're doing realloc to extend rather than replace.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One gotcha with using realloc this way -- it has O(n**2) cost, where n is the final list_count. The usual workaround is to increase the capacity of the array by a fixed ratio like 3/2; in return for potentially 50% wasted space, it is amortized O(n) cost. Probably doesn't matter for this example code tho.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I considered increasing more each time, but:
a) it requires tracking an extra capacity param and
b) for example code as you say it's probably overkill

SchemaBuilder *builder = (SchemaBuilder*)data;
char* name_ptr = allocate_name(name);
char* type = malloc(16 * sizeof(char));
sprintf(type, "decimal(%i)(%i)", precision, scale);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always use snprintf! This string could technically be up to 19 bytes long, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call to use snprintf, but how would it be 19 bytes? The protocol says precision and scale max out at 38, and "decimal(38)(38)" is 16 bytes.

Copy link
Collaborator

@scovich scovich May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what the spec says, yes... but to avoid buffer overrun risk, we have to handle the biggest i8/u8 values an adversary could throw at us.

ffi/src/lib.rs Outdated
}
child_list_id
}

fn visit_array_item(visitor: &EngineSchemaVisitor, at: &ArrayType) -> usize {
let child_list_id = (visitor.make_field_list)(visitor.data, 1);
visit_schema_item(&at.element_type, "array_data_type", visitor, child_list_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be something like "array_element"? AFAIK it corresponds to an actual column in the parquet?
(ditto for "map_key" and "map_value" below).

We also need to decide how to present these types to the engine, since parquet can have several different physical representations for LIST and MAP types (legacy 2-level vs. 3-level preferred), and we probably want to protect the engine from having to care what the underlying parquet ended up containing -- @roeap knows more about that situation. Either kernel or parquet reader will have to deal with that name mapping problem.

// Creates a new field list, optionally reserving capacity up front
make_field_list: extern "C" fn(data: *mut c_void, reserve: usize) -> usize,
/// opaque state pointer
pub data: *mut c_void,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we require NonNull<c_void>? I guess we don't actually care what the engine does with its own pointer tho?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, like you might want something that just prints out with no need for data, so it could be null.

ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
@nicklan nicklan requested a review from roeap May 6, 2024 22:25
@nicklan nicklan force-pushed the add-most-schema-types-to-ffi branch from f3a27f9 to 0fc9bd7 Compare May 7, 2024 00:03
@nicklan nicklan requested a review from scovich May 8, 2024 20:15
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Bunch of nits to consider before merge, if you want.

*/

// If you want the visitor to print out what it's being asked to do at each step, uncomment the
// following line
// #define PRINT_VISITS
//#define PRINT_VISITS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could also potentially leverage variadic macro args for this:

#ifdef VERBOSE
#define PRINT_VISITS(...) printf(__VA_ARGS__)
#else
#define PRINT_VISIT(...)
#endif

and then just use it:

PRINT_VISIT("Making a list of length %i with id %i\n", reserve, id);

(a bit more work up front, but less #ifdef magic at use sites)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also more fun :) switched to that

ffi/examples/read-table/schema.h Outdated Show resolved Hide resolved
printf("│ ");
}
}
SchemaItem* item = &list->list[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably decide whether to use &array[index] vs. array + index and be consistent?

(I personally prefer the former because it's obvious at a glance that it's not normal arithmetic... but I recognize it's also a bit more typing)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense. I tend to use array + index when I know I just want the pointer, and array[index] when I want to use the item there (like array[index].name), but consistency is good. I've updated to all &array[index] style for getting pointers.

printf("Making a list of lenth %i with id %i\n", reserve, id);
#endif
builder->list_count++;
builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if there was some valid previous usage, wouldn't we need to free the lists it contained?)

realloc does that for you (afaiu)

I was referring to the lists of SchemaItem that each SchemaItemList in the builder referred to. But it's not a problem as you say, because we're doing realloc to extend rather than replace.

printf("Making a list of lenth %i with id %i\n", reserve, id);
#endif
builder->list_count++;
builder->lists = realloc(builder->lists, sizeof(SchemaItemList) * builder->list_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One gotcha with using realloc this way -- it has O(n**2) cost, where n is the final list_count. The usual workaround is to increase the capacity of the array by a fixed ratio like 3/2; in return for potentially 50% wasted space, it is amortized O(n) cost. Probably doesn't matter for this example code tho.

ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated Show resolved Hide resolved
ffi/src/lib.rs Outdated
Comment on lines 795 to 799
DataType::Map(mt) => call!(
visit_map,
mt.value_contains_null,
visit_map_types(visitor, mt)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what decides whether to have => call!( vs. => {?

Suggested change
DataType::Map(mt) => call!(
visit_map,
mt.value_contains_null,
visit_map_types(visitor, mt)
),
DataType::Map(mt) => {
call!(visit_map, mt.value_contains_null, visit_map_types(visitor, mt))
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not entirely sure, but i think for fmt the trailing ; does, at least if there is no macro involved.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, rustfmt is a bit odd here. It'll let me put the map call into {}s, but won't let it stay on one line.

It seems to use {}s if the body of the match case needs to go on its own line, so

&DataType::STRING => call!(visit_string), is okay because it's all one line, but

DataType::Array(at) =>
    call!(visit_array, at.contains_null, visit_array_item(visitor, at)),

while valid, will be fmt'd to:

DataType::Array(at) => {
    call!(visit_array, at.contains_null, visit_array_item(visitor, at))
}

I've put the map case into {}s for a bit more consistency, but really can't find a formatting that will let it stay on one line, I think because fmt thinks it's just a bit too long.

ffi/src/lib.rs Outdated
/// representation of a schema from a particular schema within kernel.
///
/// The model is list based. When the kernel needs a list, it will ask engine to allocate one of a
/// particular size. Once allocated the engine returns an `id`, which can be any identifier the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reading "which can be any identifier the engine wants" I thought this would be an opaque type, but IIUC we are assuming usize?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I was thinking that "identifier" was enough to make it clear, but i've amended to "integer identifier ([`usize`])" which is hopefully more clear (and links to the usize docs so implementers can understand what type it will be in c/c++)

Comment on lines +640 to +643
/// - For a map, visit the key and value, passing its special name ("map_key" or "map_value"),
/// type, and value nullability (keys are never nullable)
/// - For a list, visit the element, passing its special name ("array_element"), type, and
/// nullability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if it makes that much of a difference, but should be we consistent with the arrow conversion in how we name the special fields?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked about and couldn't find any consistency online about what to name these. Agree that it's not great to have different names here than in our schema.rs code.

In the spirit of not bikeshedding too much I've opened #202 and will follow-up to rename everything properly.

ffi/src/lib.rs Outdated
Comment on lines 795 to 799
DataType::Map(mt) => call!(
visit_map,
mt.value_contains_null,
visit_map_types(visitor, mt)
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not entirely sure, but i think for fmt the trailing ; does, at least if there is no macro involved.

@roeap
Copy link
Collaborator

roeap commented May 10, 2024

Just left a few minor questions. Took me a little bit longer as I took the opportunity to go through most of the ffi code to get familiar.

@nicklan nicklan merged commit 663585d into delta-incubator:main May 10, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants