Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: casting when data to be written does not match table schema #1427

Merged
merged 4 commits into from
Jun 3, 2023

Conversation

Blajda
Copy link
Collaborator

@Blajda Blajda commented Jun 3, 2023

Description

Suppose a user has a table with column of type int. A user can create a record batch with type Uft8 and write the value to table. My expectation is that either the writer returns an error or ansi sql behavior is implemented where non-numeric strings are turned into nulls.

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Jun 3, 2023
@Blajda
Copy link
Collaborator Author

Blajda commented Jun 3, 2023

Hi @wjones127,
I discovered the following while implementing the update operation.
I have a test where a user attempts to update a column with a type that does not match the table schema. I thought the writer would return an error but it wrote the data as is to the table. It will then 'corrupt' the table since further operations on it using Datafusion will error since the datatypes are incompatible.

Seems like this breaks the invariant test since the expected underlying type for integer is i32 but the array provided is u32. Hence I think we should update get_record_batch to provide an i32 array.

Let me know what you think

@Blajda Blajda changed the title fix: fail when data to be written does not match table schema fix: casting when data to be written does not match table schema Jun 3, 2023
@Blajda Blajda marked this pull request as ready for review June 3, 2023 15:06
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore the integration test failure 😒

final link failed: No space left on device

These GitHub Actions caches are...annoying

let batch = RecordBatch::try_new(
Arc::clone(&schema),
vec![Arc::new(StringArray::from(vec![
Some("Test123".to_owned()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, where does this value go in the expected table? 😕

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right below, in line 570?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it cannot be parsed as an int it will result in a null value. Which aligns with ansi sql but maybe we should only allow that if the user opts in?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind. It seems like this is being inserted as null? That doesn't seem like what a user would want.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we could disable that by passing safe: true in CastOptions: https://docs.rs/arrow-cast/40.0.0/arrow_cast/cast/struct.CastOptions.html

@@ -704,17 +705,6 @@ mod tests {
table
}

async fn get_data(table: DeltaTable) -> Vec<RecordBatch> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good refactor 👍

@wjones127
Copy link
Collaborator

I can't find an arrow function that does more strict casting, but I kind of would prefer if we were more strict with the casting. For example, these casts aren't lossy:

  • Dictionary<int8, string> -> string
  • int32 -> int64

But I think we should reject these by default

  • string -> int32
  • timestamp -> string

@wjones127
Copy link
Collaborator

I think for now, I'd be fine with casting if we added safe: true by default. If you want to add an option to allow unsafe casting to null, I think that's fine.

@Blajda
Copy link
Collaborator Author

Blajda commented Jun 3, 2023

The name is confusing but the default behavior we want is safe: false which will return an error. safe: true will write a null value.
I expanded the test to include the timestamp case however it does write the timestamp in the RFC format without issue. Should we accept this?

@wjones127
Copy link
Collaborator

Should we accept this?

I think this is an improvement.

If we want to be more strict, we should write a function that checks the schema the user passed and compare it to the table, returning an error if it doesn't logically match. But I think that's fine to do as a follow up.

@roeap roeap merged commit 98d9dcc into delta-io:main Jun 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants