Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add function & test for parsing table_or_uri #1138

Merged
merged 6 commits into from
Feb 12, 2023

Conversation

marijncv
Copy link
Contributor

Description

Adds a more specific ValueError if a wrong table_or_uri is provided in the write_delta_lake function. It also adds support for a pathlib Path object.

Related Issue(s)

Documentation

@github-actions github-actions bot added the binding/python Issues for the Python package label Feb 12, 2023
Signed-off-by: Marijn Valk <marijncv@hotmail.com>
Signed-off-by: Marijn Valk <marijncv@hotmail.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 12, 2023
Signed-off-by: Marijn Valk <marijncv@hotmail.com>
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this. Would you be wiling to add Path support to DeltaTable too? Then we can clean up most of the unit tests that have to do the Path to str conversion.

python/tests/test_writer.py Outdated Show resolved Hide resolved
data = pa.table({"vals": pa.array(["1", "2", "3"])})
table_or_uri = str(tmp_path / "delta_table")
write_deltalake(table_or_uri, data)
delta_table = DeltaTable(table_or_uri)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We need DeltaTable.__init__ to support Path too. Would you be willing to add that as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it there! And also in try_get_deltatable

Comment on lines 331 to 332
# Non-existant local paths are only accepted as fully-qualified URIs
table_uri = "file://" + str(Path(table_or_uri).absolute())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by now we actually extended the path handling on the rust side to also support relative paths and scheme-less paths, and also create a directory if it does not yet exists.

pub(crate) fn ensure_table_uri(table_uri: impl AsRef<str>) -> DeltaResult<Url> {
let table_uri = table_uri.as_ref();
if let Ok(path) = std::fs::canonicalize(table_uri) {
return Url::from_directory_path(path)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()));
}
if let Ok(url) = Url::parse(table_uri) {
return Ok(match url.scheme() {
"file" => url,
_ => {
let mut new_url = url.clone();
new_url.set_path(url.path().trim_end_matches('/'));
new_url
}
});
}
// The table uri still might be a relative paths that does not exist.
std::fs::create_dir_all(table_uri)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))?;
let path = std::fs::canonicalize(table_uri)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))?;
Url::from_directory_path(path)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()))
}

Is it maybe a good idea to rely on the same logic on the python side as well, so we are always consistent in the supported paths?

Of course supporting Path objects will always be python territory :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! That makes the python code a bit simpler

) -> Tuple[Optional[DeltaTable], str]:
"""Parses `table_or_uri` and returns `table` & `table_uri`.

Raises a ValueError if `table_or_uri` is not of type `str`, `Path` or `DeltaTable`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe write this in numpy docstring format? https://numpydoc.readthedocs.io/en/latest/format.html#raises

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very sorry, but I said something wrong. We are not using the numpy format... we are using the reST format. not sure how raises are done there. @wjones127 - do you know? I could not find good docs for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, the format shown here looks very similar to what is used in other places in the codebase and it also specifies a way to describe a raise. I can rewrite it in that format if it's the correct one to use

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the format. Could you rewrite that?

Signed-off-by: Marijn Valk <marijncv@hotmail.com>
Signed-off-by: Marijn Valk <marijncv@hotmail.com>
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a great improvement :)

@wjones127 wjones127 merged commit fab36eb into delta-io:main Feb 12, 2023
@marijncv marijncv deleted the parsing-table-or-uri-in-dt-writer branch February 13, 2023 06:43
wjones127 pushed a commit that referenced this pull request Feb 14, 2023
Signed-off-by: Marijn Valk <marijncv@hotmail.com>

# Description
Now that #1138 is merged, we can use `pathlib.Path` instead of `str` in
`write_deltalake()` and `DeltaTable.__init__()`

---------

Signed-off-by: Marijn Valk <marijncv@hotmail.com>
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
# Description
Adds a more specific ValueError if a wrong `table_or_uri` is provided in
the `write_delta_lake` function. It also adds support for a pathlib
`Path` object.

# Related Issue(s)
- closes delta-io#1123 

# Documentation

<!---
Share links to useful documentation
--->

---------

Signed-off-by: Marijn Valk <marijncv@hotmail.com>
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
Signed-off-by: Marijn Valk <marijncv@hotmail.com>

# Description
Now that delta-io#1138 is merged, we can use `pathlib.Path` instead of `str` in
`write_deltalake()` and `DeltaTable.__init__()`

---------

Signed-off-by: Marijn Valk <marijncv@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python: mixing up parameters to write_deltalake gives confusing error
3 participants