Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncated file path returned from self._table.dataset_partitions in 0.6.2 deltalake #880

Closed
george-zubrienko opened this issue Oct 11, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@george-zubrienko
Copy link
Contributor

george-zubrienko commented Oct 11, 2022

Environment WSL2 (Ubuntu 20.04)

Delta-rs version: 0.6.2

Binding: Python (3.9)

Environment:

  • Cloud provider: -
  • OS: Ubuntu 20.04 / Windows 10
  • Other: -

Bug

What happened:

Observing different output for 0.6.1 and 0.6.2 for the same call (sorry I had different venvs on different OS's when checking up this problem):

from deltalake import DeltaTable
pds = DeltaTable('/mnt/c/users/gzu/source/repos/github/proteus/tests/delta_table', version=None, storage_options=None)
pds._table.dataset_partitions(None, pds.pyarrow_schema())

On Ubuntu with 0.6.2 gives:

[
   (
      'part-00000-7d712efc-8cc5-42ce-a5ac-46887e04ee94-c000.snappy.parquet', 
      <pyarrow.compute.Expression ((((((B >= "1") and (A >= "a")) and (A <= "a")) and (B <= "1")) and is_valid(B)) and is_valid(A))>
    ), 
    (
       'part-00001-e4bd0782-01a5-4ad7-945b-45985e006430-c000.snappy.parquet'
...

On Windows with 0.6.1:

[
   (
      'C:\\Users\\GZU\\source\\repos\\github\\proteus\\tests/delta_table\\part-00000-7d712efc-8cc5-42ce-a5ac-46887e04ee94-c000.snappy.parquet', 
      <pyarrow.compute.Expression ((((((A >= "a") and (B >= "1")) and (B <= "1")) and (A <= "a")) and is_valid(B)) and is_valid(A))>), 
   (
      'C:\\Users\\GZU\\source\\repos\\github\\proteus\\tests/delta_table\\part-00001-e4bd0782-01a5-4ad7-945b-45985e006430-c000.snappy.parquet', 
...

This results in a problem when supplying filesystem to this method:

p_ds = DeltaTable('/my/table').to_pyarrow_dataset(filesystem=PyFileSystem(FSSpecHandler(LocalFileSystem())))

This call works on Windows, but doesn't work on Ubuntu, because on Ubuntu only filename part of the path is resolved as shown above.

Interesting part is, if I omit the filesystem parameter, then everything works on both OS.

What you expected to happen:

Same behaviour on Windows and Linux for the described case.

How to reproduce it:

Run

p_ds = DeltaTable('/my/table').to_pyarrow_dataset(filesystem=PyFileSystem(FSSpecHandler(LocalFileSystem())))

On any Linux distro and observe error similar to:

proteus/storage/delta_lake/_functions.py:49: in load
    pyarrow_table: Table = pyarrow_ds.to_table(filter=row_filter, columns=columns)
pyarrow/_dataset.pyx:331: in pyarrow._dataset.Dataset.to_table
    ???
pyarrow/_dataset.pyx:2577: in pyarrow._dataset.Scanner.to_table
    ???
pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
    ???
pyarrow/_fs.pyx:1544: in pyarrow._fs._cb_open_input_file
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pyarrow.fs.FSSpecHandler object at 0x7fa8d2318370>, path = 'part-00000-7d712efc-8cc5-42ce-a5ac-46887e04ee94-c000.snappy.parquet'

    def open_input_file(self, path):
        from pyarrow import PythonFile

        if not self.fs.isfile(path):
>           raise FileNotFoundError(path)
E           FileNotFoundError: part-00000-7d712efc-8cc5-42ce-a5ac-46887e04ee94-c000.snappy.parquet

More details:
Issue only occurs in 0.6.2 version.

@george-zubrienko
Copy link
Contributor Author

Okay i've investigated this and it seems a generic issue not related to Windows - I was running 0.6.1 version on my windows test env.

@george-zubrienko george-zubrienko changed the title Truncated file path on Linux filesystem in self._table.dataset_partitions Truncated file path returned from self._table.dataset_partitions in 0.6.2 deltalake Oct 17, 2022
@george-zubrienko
Copy link
Contributor Author

george-zubrienko commented Oct 17, 2022

this change should fix it:

    pub fn get_files_iter(&self) -> impl Iterator<Item=Path> + '_ {
        self.state
            .files()
            .iter()
            .map(|add| Path::from_iter([self.table_uri(), add.path.to_string()]))
    }

testing and submitting a PR

@george-zubrienko
Copy link
Contributor Author

Alright I finally figured this out. Apparently all tests in deltalake now make use of SubTreeFileSystem, which was the missing piece in this puzzle.

So, using this custom FS implementation will work:

    def get_pyarrow_filesystem(self, path: DataPath) -> PyFileSystem:

        connection_options = self.connect_storage(path=path)
        file_system = AzureBlobFileSystem(
            account_name=connection_options['AZURE_STORAGE_ACCOUNT_NAME'],
            account_key=connection_options['AZURE_STORAGE_ACCOUNT_KEY']
        )

        return SubTreeFileSystem(path.to_hdfs_path(), PyFileSystem(FSSpecHandler(file_system)))

Closing the issue.

@roeap
Copy link
Collaborator

roeap commented Oct 17, 2022

Sorry for being a bit late since you already figured all of this out, but there is a little bit of docs around this.

https://delta-io.github.io/delta-rs/python/usage.html#custom-storage-backends

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants