-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify partition_values field type in Add/Remove actions. #354
Conversation
Thank you @zijie0 ! Looks good from a quick glance, I will take a closer look later today. |
@houqp I'm not sure whether I need to add some write tests. I tried locally changed following test: https://github.com/delta-io/delta-rs/blob/main/rust/tests/write_exploration.rs#L422 with empty string value:
The generated files seems to be the same with "native" delta. |
col_name: k.to_string(), | ||
})? | ||
.clone() | ||
.unwrap_or("__HIVE_DEFAULT_PARTITION__".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminded me that we are doing partition filtering wrong in get_files_by_partitions
. We are parsing Action.path to reconstruct the partition values, but instead, we should be using Action.partition_values
or Action.partition_values_parsed
directly.
I will file a separate issue to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -150,7 +173,7 @@ pub struct Add { | |||
/// The size of this file in bytes | |||
pub size: DeltaDataTypeLong, | |||
/// A map from partition column to value for this file | |||
pub partition_values: HashMap<String, String>, | |||
pub partition_values: HashMap<String, Option<String>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurred to me that the problem you found has a much bigger impact than just the partition_values field, it also applies to Add.tags, Remove.tags, MetaData.format.options, MetaData.configurations.
Let me know if you prefer to address them or only focus on partition_values in this PR. If you prefer the latter, I can file an issue to track the fix for the other fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey QP, I went through the spec doc but didn't found situations that those structs could also contain null values in the map. Is that because Map[String, String]
allows null value in Scala, so that we also need to make the change to HashMap<String, Option<String>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, based on the discussion in the dev list, all data types are annotated using scala syntax, so all Map[String, String]
should map to HashMap<String, Option<String>>
in rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, I am perfectly fine with us addressing these other fields in an other PR if you want to keep the scope of this PR small. The fix for partition_values itself is already a big improvement :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a look on this :)
@zijie0 it would be great if you could help write some write tests to cover the cases of both empty strings and actual null values. So perhaps something like below:
|
rust/tests/write_exploration.rs
Outdated
let files = delta_table.get_files(); | ||
assert_eq!(files.len(), 1); | ||
assert_eq!( | ||
files[0].split("/").next().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor ask, could we check the partition_values field instead to make sure it's null? delta doesn't care whether we have partition value in the file path or not, so it's perfectly valid to not have modified=__HIVE_DEFAULT_PARTITION_
in the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I will change the assertion.
The rest looks good to me, thank you @zijie0 for discovering this big can of worms :) |
It is weird, |
7d428e4
to
48c0702
Compare
Sorry, forgot to rebase on main branch... |
Thank you @zijie0 ! |
Description
Modify partition_values to type HashMap<String, Option> according to the delta spec.
Related Issue(s)
Documentation
Detailed discussion here: https://groups.google.com/g/delta-users/c/ZzzrtNRqRyk