Skip to content

Commit

Permalink
fix: avoid some allocations in DeltaStorageHandler (delta-io#1115)
Browse files Browse the repository at this point in the history
# Description

Fixes an error in the `DeltaFileSystemHandler`, when reading file
metadata from remote storages. Due to an inconsistency between the
behaviour object stores when invoking list operations on a path that
points to a file, we incorrectly returned an Directory type for files in
case of object stores. The bug only surfaced when using pyarrow < 9,
since we used the call only when getting the file size, which we avoid
when using more recent pyarrow versions.

@tustvold - I seem to vaguely remember discussing this at some point,
but am not sure anymore. Is this something we should look into in
object-store?

Update: validated locally, that the upstream fixes will fix the linked
issue, so the main reason for this PR is resolved elsewhere. There are
some changes included which safe us some allocation (admittedly very
few), but hopefully an improvement anyhow.

# Related Issue(s)

closes delta-io#1109 

# Documentation

<!---
Share links to useful documentation
--->
  • Loading branch information
roeap authored and chitralverma committed Mar 17, 2023
1 parent e675fea commit d73d94b
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 8 deletions.
14 changes: 7 additions & 7 deletions python/src/filesystem.rs
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ impl DeltaFileSystemHandler {
let fs = PyModule::import(py, "pyarrow.fs")?;
let file_types = fs.getattr("FileType")?;

let to_file_info = |loc: String, type_: &PyAny, kwargs: HashMap<&str, i64>| {
let to_file_info = |loc: &str, type_: &PyAny, kwargs: &HashMap<&str, i64>| {
fs.call_method("FileInfo", (loc, type_), Some(kwargs.into_py_dict(py)))
};

Expand All @@ -116,16 +116,16 @@ impl DeltaFileSystemHandler {
("mtime_ns", meta.last_modified.timestamp_nanos()),
]);
infos.push(to_file_info(
meta.location.to_string(),
meta.location.as_ref(),
file_types.getattr("File")?,
kwargs,
&kwargs,
)?);
}
Err(ObjectStoreError::NotFound { .. }) => {
infos.push(to_file_info(
path.to_string(),
path.as_ref(),
file_types.getattr("NotFound")?,
HashMap::new(),
&HashMap::new(),
)?);
}
Err(err) => {
Expand All @@ -134,9 +134,9 @@ impl DeltaFileSystemHandler {
}
} else {
infos.push(to_file_info(
path.to_string(),
path.as_ref(),
file_types.getattr("Directory")?,
HashMap::new(),
&HashMap::new(),
)?);
}
}
Expand Down
15 changes: 14 additions & 1 deletion python/tests/test_fs.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from pyarrow.fs import S3FileSystem
from pyarrow.fs import FileType

from deltalake import DeltaTable, PyDeltaTableError
from deltalake.fs import DeltaStorageHandler
Expand All @@ -28,6 +28,19 @@ def test_read_files(s3_localstack):
assert table.shape > (0, 0)


@pytest.mark.s3
@pytest.mark.integration
@pytest.mark.timeout(timeout=15, method="thread")
def test_read_file_info(s3_localstack):
table_path = "s3://deltars/simple"
handler = DeltaStorageHandler(table_path)
meta = handler.get_file_info(
["part-00000-a72b1fb3-f2df-41fe-a8f0-e65b746382dd-c000.snappy.parquet"]
)
assert len(meta) == 1
assert meta[0].type == FileType.File


@pytest.mark.s3
@pytest.mark.integration
@pytest.mark.timeout(timeout=15, method="thread")
Expand Down

0 comments on commit d73d94b

Please sign in to comment.