apply a schema to fix column names #331

nicklan · 2024-09-10T17:25:30Z

Enforce/apply the schema given when we evaluate an expression.

Given we want to go to expression based fixup and allow the final schema to dictate the output, we will need to do this.

This code will fix-up at all levels of the output, which is messy in arrow since schemas are embedded all over the place. The schema is only applied if the output of the expression doesn't exactly match the passed schema.

codecov · 2024-09-10T17:33:10Z

Codecov Report

Attention: Patch coverage is 16.24365% with 165 lines in your changes missing coverage. Please review.

Project coverage is 75.54%. Comparing base (1e19980) to head (29b09f7).

Files with missing lines	Patch %	Lines
kernel/src/engine/arrow_expression.rs	3.59%	133 Missing and 1 partial ⚠️
kernel/src/scan/mod.rs	44.44%	26 Missing and 4 partials ⚠️
kernel/src/engine/arrow_utils.rs	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #331      +/-   ##
==========================================
- Coverage   76.85%   75.54%   -1.32%     
==========================================
  Files          47       47              
  Lines        9436     9620     +184     
  Branches     9436     9620     +184     
==========================================
+ Hits         7252     7267      +15     
- Misses       1789     1956     +167     
- Partials      395      397       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich

Initial pass. Interested to understand better how this fits into the broader column mapping situation etc.

kernel/src/engine/arrow_expression.rs

scovich · 2024-09-10T20:03:30Z

kernel/src/engine/arrow_expression.rs

+    // build up a new set of (col, field) pairs
+    let result_iter = arrow_fields.into_iter().zip(schema.fields()).zip(cols).map(
+        |((arrow_field, kernel_field), col)| -> DeltaResult<(Arc<dyn Array>, ArrowField)> {
+            match (&kernel_field.data_type, arrow_field.data_type()) {


Do we also need to check col.data_type()? It should exactly match arrow_field.data_type(), right?

Update: I guess the as_struct_opt call, combined with recursion on the child fields, would validate that as a side effect?

The new code pulls the data types from the columns as it recurses, so I think this is covered now.

scovich · 2024-09-10T20:57:31Z

kernel/src/engine/arrow_expression.rs

+        },
+    );
+    let (new_cols, new_fields): (Vec<Arc<dyn Array>>, Vec<ArrowField>) =
+        result_iter.process_results(|iter| iter.unzip())?;


Trying to grok this process_results thing...

It takes an Iterator<DeltaResult<(Arc<dyn Array>, Field)>> as input?

And then, it specially maps that iterator so that:

Any Err result from the input iterator immediately triggers an error result of the overall operation?

The mapping function receives an iterator of all the Ok values, which can be manipulated however we want

Similar to collect, type inference can produce an output of any type (not just Iterator)

The output is thus DeltaResult<(Vec<_>, Vec<_>)>?

So basically, this is a collect that allows transformation of the iterator before materializing it, but with the same error-stop semantics?

(phew, that's complicated)

I think this is functionally equivalent to collect .. iter .. unzip .. collect?
But more efficient because it doesn't have to materialize the intermediate result?

Put another way: It's like Result.and_then, but for an iterator with fail-stop semantics?

Yep. I basically think of it as, if i have a function that takes Iter<T>, and I want to apply it over a iterator of Result<T, E>, then I can call process_results with the function.

process_results takes care of unwrapping the Ok values, calling the process function on all the inner values of the Oks, and then wraps the final result in an Ok. Otherwise it just stops at the first Err.

I think this is functionally equivalent to collect .. iter .. unzip .. collect?

Almost, without the final collect, since unzip is basically two collects in a trench-coat.

scovich · 2024-09-10T21:01:07Z

kernel/src/error.rs

@@ -171,7 +171,7 @@ impl Error {
        Self::FileNotFound(path.to_string())
    }
    pub fn missing_column(name: impl ToString) -> Self {
-        Self::MissingColumn(name.to_string())
+        Self::MissingColumn(name.to_string()).with_backtrace()


temporary debugging aid? or intentional?

intentional, since this happens in multiple places it's useful to have a backtrace.

scovich · 2024-09-16T16:17:31Z

kernel/src/engine/arrow_utils.rs

@@ -53,8 +53,8 @@ pub(crate) use prim_array_cmp;
 /// returns a tuples of (mask_indicies: Vec<parquet_schema_index>, reorder_indicies:
 /// Vec<requested_index>). `mask_indicies` is used for generating the mask for reading from the

-fn make_arrow_error(s: String) -> Error {
-    Error::Arrow(arrow_schema::ArrowError::InvalidArgumentError(s))
+pub(crate) fn make_arrow_error(s: String) -> Error {


Suggested change

pub(crate) fn make_arrow_error(s: String) -> Error {

pub(crate) fn make_arrow_error(s: impl Into<String>) -> Error {

^^simplifies call sites that pass &str without duplicating an arg that was already a String (because Into::into consumes its argument).

One could also use s: impl ToString -- which accepts more types as input -- but String.to_string must always make a copy because it takes &self instead of self.

scovich · 2024-10-09T15:38:11Z

kernel/src/engine/arrow_expression.rs

+// make column `col` with type `arrow_type` look like `kernel_type`. For now this only handles name
+// transforms. if the actual data types don't match, this will return an error


What about nullability? IIRC we have to read parquet with everything nullable, because parquet can't express the concept of a non-nullable field nesting inside a nullable field.

Or did we handle that already by just making everything nullable in our action schema?

scovich · 2024-10-09T15:53:44Z

kernel/src/engine/arrow_expression.rs

+) -> DeltaResult<Option<Arc<dyn Array>>> {
+    match (kernel_type, arrow_type) {
+        (DataType::Struct(kernel_fields), ArrowDataType::Struct(arrow_fields)) => {
+            if kernel_fields.fields.len() != arrow_fields.len() {


Many query engines (including both spark and arrow) allow schema widening, where fields present only in the target schema are allowed, and inferred to be null. Fields present only in source, or fields present in both but with incompatible types, are still forbidden.

So the question: Do we anticipate such capability being useful or even necessary, given expected kernel use cases? If so, we may want to consider supporting it (at least, enough to ensure that our implementation can be extended to support it later). If not needed, we should avoid adding yet more complexity.

A similar question: Do we anticipate it being useful to support dropping fields that are present only in the source?

Asking because nested column pruning is actually quite difficult to express, without some capability like this.

But on the other hand, it seems like we should prefer to do all column pruning -- nested or otherwise -- up front (before reading) so we don't have to materialize unwanted data.

scovich · 2024-10-09T16:09:59Z

kernel/src/engine/arrow_expression.rs

+            let sa = col.as_struct_opt().ok_or(make_arrow_error(
+                "Arrow claimed to be a struct but isn't a StructArray".to_string(),
+            ))?;


Suggested change

let sa = col.as_struct_opt().ok_or(make_arrow_error(

"Arrow claimed to be a struct but isn't a StructArray".to_string(),

))?;

let sa = col.as_struct_opt().ok_or_else(|| make_arrow_error(

"Arrow claimed to be a struct but isn't a StructArray",

))?;

(we don't want to create the error unless it's needed)

(many others to fix as well)

(we should fix make_arrow_error to take impl Into<String> instead of String)

Alternatively, I recently learned a neat rust trick:

Suggested change

let sa = col.as_struct_opt().ok_or(make_arrow_error(

"Arrow claimed to be a struct but isn't a StructArray".to_string(),

))?;

let Some(sa) = col.as_struct_opt() else {

return Err(make_arrow_error("Arrow claimed to be a struct but isn't a StructArray"));

};

It works just like if let, but the else clause must diverge (panic, return, break, etc).

It may not be a net win in this specific case tho.

scovich · 2024-10-09T16:23:18Z

kernel/src/engine/arrow_expression.rs

+            let sa = col.as_struct_opt().ok_or(make_arrow_error(
+                "Arrow claimed to be a struct but isn't a StructArray".to_string(),
+            ))?;
+            let (fields, sa_cols, sa_nulls) = sa.clone().into_parts();


It seems unfortunate that we can't consume sa... but I guess it's all Arc all the way down, so hopefully the cloning is at least cheap?

scovich · 2024-10-09T19:56:27Z

kernel/src/scan/mod.rs

+fn make_data_type_physical(
+    logical_dt: &DataType,
+    column_mapping_mode: ColumnMappingMode,


FYI: Normally, the field's field id and physical name are stored in the field's metadata... but Iceberg requires field ids even for the internal columns used by Map and Array, and there's no way to associate metadata with those. So, when IcebergCompatV2 table feature is enabled, we have to remember the most recently-seen field, as well as the column path we descended through since then, so we can fetch the field ids out of that parent field's metadata.

Actually, I don't think Delta cares about those iceberg field ids -- even in column mapping field mode -- so maybe we can ignore all of this on the read path.

scovich · 2024-10-09T19:59:10Z

kernel/src/scan/mod.rs

+                    // build up the mapped child fields
+                    let children = struct_type
+                        .fields()
+                        .map(|field| make_field_physical(field, column_mapping_mode))


Somewhere near here, we need to actually fetch the field's physical name from the field metadata, no?

Ah make_field_physical != make_data_type_physical. And it does the right thing.

scovich · 2024-10-09T20:04:55Z

kernel/src/scan/mod.rs

+                        .fields()
+                        .map(|field| make_field_physical(field, column_mapping_mode))
+                        .try_collect()?;
+                    Ok(DataType::Struct(Box::new(StructType::new(children))))


Suggested change

Ok(DataType::Struct(Box::new(StructType::new(children))))

Ok(DataType::struct_type(children))

scovich · 2024-10-09T20:12:29Z

kernel/src/scan/mod.rs

+            Ok(StructField {
+                name: physical_name.to_string(),
+                data_type: mapped_data_type,
+                nullable: logical_field.nullable,
+                metadata: logical_field.metadata.clone(),
+            })


Is this better or worse?

Suggested change

Ok(StructField {

name: physical_name.to_string(),

data_type: mapped_data_type,

nullable: logical_field.nullable,

metadata: logical_field.metadata.clone(),

})

let mut field = StructField::new(physial_name, data_type, logical_field.nullable);

field.metadata = logical_field.metadata.clone();

Ok(field)

scovich · 2024-10-09T21:03:59Z

kernel/src/scan/mod.rs

+fn make_field_physical(
+    logical_field: &StructField,
+    column_mapping_mode: ColumnMappingMode,
+) -> DeltaResult<StructField> {
+    match column_mapping_mode {
+        ColumnMappingMode::None => Ok(logical_field.clone()),
+        ColumnMappingMode::Name => {
+            let physical_name = logical_field.physical_name(column_mapping_mode)?;


Column mapping is pretty fundamental... I wonder if we should define these as methods on the various types?

impl StructField { pub fn to_physical(self, mode: ColumnMappingMode) -> DeltaResult<Self> { Ok(Self { name: self.physical_name(mode)?.into(), data_type: self.data_type.to_physical(mode)?, ..self, }) } } impl DataType { pub fn to_physical(self, mode: ColumnMappingMode) -> DeltaResult<Self> { let result = match self { Struct(s) => DataType::try_struct_type( s.fields.into_values().map(|field| field.to_physical(mode)), )?, Array(a) => DataType::array_type( a.element_type.to_physical(mode)?, a.contains_null, ), Map(m) => DataType::map_type( m.key_type.to_physical(mode)?, m.value_type.to_physical(mode)?, m.value_contains_null, ), Primitive(_) => self, }; Ok(result) } }

Note that the above consumes self so we would need to clone the input first. Seems like a welcome simplification that doesn't change big-O bound and so satisfies the "don't optimize, don't pessimize" mantra?

Meanwhile: The existing StructField::with_field takes &self instead of self like with_XXX methods normally do. It only has one call site so far (ignoring this PR), so we should probably fix that ASAP.

Also, the existing StructType::new takes Vec<StructField> when it would be much nicer to take impl IntoIterator<Item = StructField>. I'll probably throw up a PR to fix that one soon.

Also, I noticed that the test_field_metadata unit test (schema.rs) for physical names is going directly to the metadata instead of exercising the physical_name method we actually use in practice??

PR to simplify StructType is up: #385

nicklan requested review from scovich, zachschuermann and roeap September 10, 2024 17:25

scovich reviewed Sep 10, 2024

View reviewed changes

Nick Lanham and others added 9 commits October 8, 2024 16:16

apply a schema to fix column names

5a7c419

remove an unwrap

31e1977

cleanup

5acbdb2

checkpoint with working rename

1ea7554

cleanup unneeded complexity, fmt

54493aa

add map types

532a870

fmt

a5feb17

only apply if needed

91ca99c

fmt+clippy

29b09f7

nicklan marked this pull request as ready for review October 8, 2024 23:33

nicklan force-pushed the fix-cm-name-take1 branch from 57f3c96 to 29b09f7 Compare October 8, 2024 23:34

nicklan requested a review from scovich October 8, 2024 23:35

scovich reviewed Oct 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apply a schema to fix column names #331

apply a schema to fix column names #331

nicklan commented Sep 10, 2024 •

edited

Loading

codecov bot commented Sep 10, 2024 •

edited

Loading

scovich left a comment

scovich Sep 10, 2024

scovich Sep 10, 2024

nicklan Oct 8, 2024

scovich Sep 10, 2024

scovich Sep 10, 2024

nicklan Oct 8, 2024

scovich Sep 10, 2024

nicklan Oct 8, 2024

scovich Sep 16, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

scovich Oct 9, 2024

	pub(crate) fn make_arrow_error(s: String) -> Error {
	pub(crate) fn make_arrow_error(s: impl Into<String>) -> Error {

		// make column `col` with type `arrow_type` look like `kernel_type`. For now this only handles name
		// transforms. if the actual data types don't match, this will return an error

	Ok(DataType::Struct(Box::new(StructType::new(children))))
	Ok(DataType::struct_type(children))

apply a schema to fix column names #331

Are you sure you want to change the base?

apply a schema to fix column names #331

Conversation

nicklan commented Sep 10, 2024 • edited Loading

codecov bot commented Sep 10, 2024 • edited Loading

Codecov Report

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan commented Sep 10, 2024 •

edited

Loading

codecov bot commented Sep 10, 2024 •

edited

Loading