Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

files_by_partitions() changed the return type from absolute to relative paths in 0.6.2 #894

Closed
jeisinge opened this issue Oct 19, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@jeisinge
Copy link

Environment

Delta-rs version: 0.6.2

Binding: python

Environment:

  • Cloud provider: Databricks
  • OS: Ubuntu
  • Other:

Bug

What happened:
In previous versions <= 0.6.1, files_by_partitions returned the absolute path of the files. However, in 0.6.2, the relative path is returned.

What you expected to happen:
Minor version upgrades will not break API.

How to reproduce it:

  delta_lake_path = "/tmp/foo"
  delta_table = DeltaTable(delta_lake_path)
  parquet_files = delta_table.files_by_partitions(partition_filters)

In <= 0.61, parquet_files is an absolute path. In 0.6.2, it is relative to delta_lake_path.

More details:
Unclear if this has been resolved per #880 .

It sounds like https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends could be used. However, this Delta Lake is on a mount, so it is unclear if this is needed as it appears as a local file to the system. More importantly, it was unexpected for the API to change between minor versions.

It is also worth noting that https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends does not specify that the path will be absolute or relative.

@jeisinge jeisinge added the bug Something isn't working label Oct 19, 2022
@wjones127
Copy link
Collaborator

Thank you for pointing this out; we should not have changed that API.

@vsytchenko-exos
Copy link

Is there any news regarding the issue? Any plans to fix it?

@wjones127
Copy link
Collaborator

Sorry haven't gotten around to this, but I will look at this soon. IMO we have too many methods for getting files right now, and I'd like to par them down. Here's what I'm thinking:

  • DeltaTable.files() -> returns relative paths. Add optional partition_filters argument.
  • DeltaTable.file_uris() -> returns absolute URI. Add optional partition_filters argument.
  • DeltaTable.files_by_partition() -> bring back to absolute paths, but also deprecate in favor of file_uris().
  • DeltaTable.file_paths() -> delete. Has already been deprecated for a while.

fvaleye pushed a commit that referenced this issue Dec 27, 2022
# Description

This PR consolidates the four methods `files()`, `file_paths()`,
`file_uris()`, `files_by_partitions()` into just two methods:

* `files()` -> which returns paths as they are in the Delta Log (usually
relative, but *can* be absolute, particularly if they are located
outside of the delta table root).
 * `file_uris()`, which returns absolute URIs for all files.

Both of these now take the `partition_filters` parameter, making
`files_by_partitions()` obsolete. That latter function has been marked
deprecated, but it also returns it to its original behavior of returning
absolute file paths and not relative ones, resolving #894.

Finally, the `partition_filters` parameter now supports passing values
other than strings, such as integers and floats.

TODO:

 * [x] Update documentation
* [ ] ~~Test behavior of filtering for null or non-null~~ Null handling
isn't supported by DNF filters IIUC
 * [x] Test behavior of paths on object stores.

# Related Issue(s)
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants