Skip to content

Commit

Permalink
feat: globbing with fsspec (#1061)
Browse files Browse the repository at this point in the history
* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* feat: set `fsspec` (`s3fs`) as default handler for s3 paths (#1032)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* set version to 5.2.0rc1 (release candidate)

* set s3fs as default for s3

* test different handlers

* correct serialization of fsspec source

* feat: simplify object path split (#1028)

* simplify object path split

* add example from #975

* fix tests

* add more test cases

* test case update

* remove scheme unused regex

* feat: fsspec for all non-object writing - %-encoded urls no longer decoded (#1034)

* writing goes through fsspec

* increase rc version

* type hints and docs

* add helper methods, create

* throw more specific error

* add additional test for `create` failure with scheme other than local

* simplify source selection

* remove windows specific code

* raise exception if invalid combination of handler / input (file-like object and fsspec)

* use softer check for file-like object

* cover problematic case with additional slash (file:///c:/file.root)

* test "file:" scheme (no slash)

* test backslash

* test: improve path object split tests (#1039)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* feat: set `fsspec` (`s3fs`) as default handler for s3 paths (#1032)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* set version to 5.2.0rc1 (release candidate)

* set s3fs as default for s3

* test different handlers

* correct serialization of fsspec source

* feat: simplify object path split (#1028)

* simplify object path split

* add example from #975

* fix tests

* add more test cases

* test case update

* remove scheme unused regex

* feat: fsspec for all non-object writing - %-encoded urls no longer decoded (#1034)

* writing goes through fsspec

* increase rc version

* type hints and docs

* add helper methods, create

* throw more specific error

* add additional test for `create` failure with scheme other than local

* simplify source selection

* remove windows specific code

* raise exception if invalid combination of handler / input (file-like object and fsspec)

* use softer check for file-like object

* cover problematic case with additional slash (file:///c:/file.root)

* test "file:" scheme (no slash)

* test backslash

* add new test case

* split big test in two

* retry on socket error

* xrootd iterator

* iterate over different files

* iterate over tree

* pytest fixture for test directory

* pytest fixture for test directory

* add annotation to open argument

* remove repeated test

* test: add test for issue 1054 (newer fsspec failing to parse files with colons in name) (#1055)

* add test for issue 1054

* additional test

* make sure fsspec fix works

* try new test in older fsspec version (need to test windows)

* skip test in windows due to colons in name

* add explicit object-path split with open

* revert use fsspec fork in ci

* use fsspec to expand glob

* skip root from remote_schemas

* test iterate over xrootd

* test

* add temporary install to ci

* remove ci debug

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* feat: set `fsspec` (`s3fs`) as default handler for s3 paths (#1032)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* set version to 5.2.0rc1 (release candidate)

* set s3fs as default for s3

* test different handlers

* correct serialization of fsspec source

* feat: simplify object path split (#1028)

* simplify object path split

* add example from #975

* fix tests

* add more test cases

* test case update

* remove scheme unused regex

* feat: fsspec for all non-object writing - %-encoded urls no longer decoded (#1034)

* writing goes through fsspec

* increase rc version

* type hints and docs

* add helper methods, create

* throw more specific error

* add additional test for `create` failure with scheme other than local

* simplify source selection

* remove windows specific code

* raise exception if invalid combination of handler / input (file-like object and fsspec)

* use softer check for file-like object

* cover problematic case with additional slash (file:///c:/file.root)

* test "file:" scheme (no slash)

* test backslash

* test: add test for issue 1054 (newer fsspec failing to parse files with colons in name) (#1055)

* add test for issue 1054

* additional test

* make sure fsspec fix works

* try new test in older fsspec version (need to test windows)

* skip test in windows due to colons in name

* add explicit object-path split with open

* revert use fsspec fork in ci

* test: improve path object split tests (#1039)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* feat: set `fsspec` (`s3fs`) as default handler for s3 paths (#1032)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* feat: set fsspec as default source (#1023)

* feat: add fsspec as required dependency (#1021)

* fsspec requirements

* simplify fsspec import

* use loop property

* correctly create schemes list

* remove deprecated handlers from docs

* simplify source selection

* return object source

* pickle executor

* rename test

* test more handlers

* option to check writeable file-like object

* rename test

* explicitly set handler

* fix s3 source

* rename test

* Revert "fix s3 source"

This reverts commit e76fdbb.

* sesparate PR for s3 fix (#1024)

* strip file://

* rename test

* rename tests

* add aiohttp skip

* attempt to parse windows paths

* test ci

* Revert "test ci"

This reverts commit 4c1c8a5.

* rename test

* remove fsspec from test

* remove *_handler options

* update defaults

* do not override default s3

* do not use fsspec for multiprocessing

* rename test

* fix not selecting object source

* missing import

* normalize doc

* remove helper

* never return None as source

* remove unnecessary xrootd source default override since fsspec is default now

* rename test

* add empty class to pass old pickle test

* set version to 5.2.0rc1 (release candidate)

* set s3fs as default for s3

* test different handlers

* correct serialization of fsspec source

* feat: simplify object path split (#1028)

* simplify object path split

* add example from #975

* fix tests

* add more test cases

* test case update

* remove scheme unused regex

* feat: fsspec for all non-object writing - %-encoded urls no longer decoded (#1034)

* writing goes through fsspec

* increase rc version

* type hints and docs

* add helper methods, create

* throw more specific error

* add additional test for `create` failure with scheme other than local

* simplify source selection

* remove windows specific code

* raise exception if invalid combination of handler / input (file-like object and fsspec)

* use softer check for file-like object

* cover problematic case with additional slash (file:///c:/file.root)

* test "file:" scheme (no slash)

* test backslash

* add new test case

* split big test in two

* retry on socket error

* xrootd iterator

* iterate over different files

* iterate over tree

* pytest fixture for test directory

* pytest fixture for test directory

* add annotation to open argument

* remove repeated test

* test: add test for issue 1054 (newer fsspec failing to parse files with colons in name) (#1055)

* add test for issue 1054

* additional test

* make sure fsspec fix works

* try new test in older fsspec version (need to test windows)

* skip test in windows due to colons in name

* add explicit object-path split with open

* revert use fsspec fork in ci

* try to expand all glob strings if they have the protocol

* making it work on windows

* testing globbing for s3

* add failing test for http globbing

* test more handlers, failing test for xrootd (missing files)

* understanding error

* add class method to extract fsspec options

* call super constructor for fsspec source

* pass options to regularize files util

* python 3.12 aiohttp test in other PR

* attempt to hide the ssl destructor error

* retry on "expired"

* style: pre-commit fixes

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
2 people authored and jpivarski committed Dec 13, 2023
1 parent af0b55a commit e64da31
Show file tree
Hide file tree
Showing 6 changed files with 137 additions and 30 deletions.
2 changes: 1 addition & 1 deletion src/uproot/_dask.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ def dask(
array from ``TTrees``.
"""

files = uproot._util.regularize_files(files, steps_allowed=True)
files = uproot._util.regularize_files(files, steps_allowed=True, **options)

is_3arg = [len(x) == 3 for x in files]
if any(is_3arg):
Expand Down
36 changes: 23 additions & 13 deletions src/uproot/_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,11 +287,6 @@ def regularize_path(path):
return path


# These schemes may not appear in fsspec if the corresponding libraries are not installed (e.g. s3fs)
_remote_schemes = ["root", "s3", "http", "https"]
_schemes = list({*_remote_schemes, *fsspec.available_protocols()})


def file_object_path_split(urlpath: str) -> tuple[str, str | None]:
"""
Split a path with a colon into a file path and an object-in-file path.
Expand Down Expand Up @@ -815,7 +810,9 @@ def regularize_steps(steps):
return out.tolist()


def _regularize_files_inner(files, parse_colon, counter, HasBranches, steps_allowed):
def _regularize_files_inner(
files, parse_colon, counter, HasBranches, steps_allowed, **options
):
files2 = regularize_path(files)

maybe_steps = None
Expand All @@ -830,12 +827,24 @@ def _regularize_files_inner(files, parse_colon, counter, HasBranches, steps_allo
else:
file_path, object_path = files, None

# This parses the windows drive letter as a scheme!
parsed_url = urlparse(file_path)

if parsed_url.scheme.lower() in _remote_schemes:
yield file_path, object_path, maybe_steps

scheme = parsed_url.scheme
if "://" in file_path and scheme not in ("file", "local"):
# user specified a protocol, so we use fsspec to expand the glob and return the full paths
file_names_full = [
file.full_name
for file in fsspec.open_files(
files,
**uproot.source.fsspec.FSSpecSource.extract_fsspec_options(options),
)
]
# https://github.com/fsspec/filesystem_spec/issues/1459
# Not all protocols return the full_name attribute correctly (if they have url parameters)
for file_name_full in file_names_full:
yield file_name_full, object_path, maybe_steps
else:
# no protocol, default to local file system
expanded = os.path.expanduser(file_path)
if _regularize_files_isglob.search(expanded) is None:
yield file_path, object_path, maybe_steps
Expand Down Expand Up @@ -885,14 +894,15 @@ def _regularize_files_inner(files, parse_colon, counter, HasBranches, steps_allo
counter,
HasBranches,
steps_allowed,
**options,
):
yield file_path, object_path, maybe_steps

elif isinstance(files, Iterable):
for file in files:
counter[0] += 1
for file_path, object_path, maybe_steps in _regularize_files_inner(
file, parse_colon, counter, HasBranches, steps_allowed
file, parse_colon, counter, HasBranches, steps_allowed, **options
):
yield file_path, object_path, maybe_steps

Expand All @@ -905,7 +915,7 @@ def _regularize_files_inner(files, parse_colon, counter, HasBranches, steps_allo
)


def regularize_files(files, steps_allowed):
def regularize_files(files, steps_allowed, **options):
"""
Common code for regularizing the possible file inputs accepted by uproot so they can be used by uproot internal functions.
"""
Expand All @@ -915,7 +925,7 @@ def regularize_files(files, steps_allowed):
seen = set()
counter = [0]
for file_path, object_path, maybe_steps in _regularize_files_inner(
files, True, counter, HasBranches, steps_allowed
files, True, counter, HasBranches, steps_allowed, **options
):
if isinstance(file_path, str):
key = (counter[0], file_path, object_path)
Expand Down
4 changes: 2 additions & 2 deletions src/uproot/behaviors/TBranch.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ def iterate(
array from ``TTrees``.
* :doc:`uproot._dask.dask`: returns an unevaluated Dask array from ``TTrees``.
"""
files = uproot._util.regularize_files(files, steps_allowed=False)
files = uproot._util.regularize_files(files, steps_allowed=False, **options)
decompression_executor, interpretation_executor = _regularize_executors(
decompression_executor, interpretation_executor, None
)
Expand Down Expand Up @@ -340,7 +340,7 @@ def concatenate(
single concatenated array from ``TTrees``.
* :doc:`uproot._dask.dask`: returns an unevaluated Dask array from ``TTrees``.
"""
files = uproot._util.regularize_files(files, steps_allowed=False)
files = uproot._util.regularize_files(files, steps_allowed=False, **options)
decompression_executor, interpretation_executor = _regularize_executors(
decompression_executor, interpretation_executor, None
)
Expand Down
1 change: 0 additions & 1 deletion src/uproot/models/TTree.py
Original file line number Diff line number Diff line change
Expand Up @@ -906,7 +906,6 @@ def read_members(self, chunk, cursor, context, file):
uproot.classes["TTree"] = Model_TTree
uproot.classes["ROOT::TIOFeatures"] = Model_ROOT_3a3a_TIOFeatures


fEntriesStruct = struct.Struct(">q")


Expand Down
25 changes: 12 additions & 13 deletions src/uproot/source/fsspec.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,30 +27,29 @@ class FSSpecSource(uproot.source.chunk.Source):
"""

def __init__(self, file_path: str, **options):
options = dict(uproot.reading.open.defaults, **options)
storage_options = {
k: v
for k, v in options.items()
if k not in uproot.reading.open.defaults.keys()
}

self._fs, self._file_path = fsspec.core.url_to_fs(file_path, **storage_options)
super().__init__()
self._fs, self._file_path = fsspec.core.url_to_fs(
file_path, **self.extract_fsspec_options(options)
)

# What should we do when there is a chain of filesystems?
self._async_impl = self._fs.async_impl

self._executor = None
self._file = None
self._fh = None

self._num_requests = 0
self._num_requested_chunks = 0
self._num_requested_bytes = 0

self._open()

self.__enter__()

@classmethod
def extract_fsspec_options(cls, options: dict) -> dict:
uproot_default_options = dict(uproot.reading.open.defaults)
options = dict(uproot_default_options, **options)
return {
k: v for k, v in options.items() if k not in uproot_default_options.keys()
}

def _open(self):
self._executor = FSSpecLoopExecutor()
self._file = self._fs.open(self._file_path)
Expand Down
99 changes: 99 additions & 0 deletions tests/test_0692_fsspec_reading.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,3 +388,102 @@ def test_issue_1035(handler):
branch = tree["MuonSpectrometerTrackParticlesAuxDyn.truthParticleLink"]
data = branch.array()
assert len(data) == 40


@pytest.mark.network
@pytest.mark.xrootd
@pytest.mark.parametrize(
"handler",
[
uproot.source.fsspec.FSSpecSource,
None,
],
)
def test_fsspec_globbing_xrootd(handler):
pytest.importorskip("XRootD")
pytest.importorskip("fsspec_xrootd")
iterator = uproot.iterate(
{
"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/Run2012B_*.root": "Events"
},
["PV_x"],
handler=handler,
)

arrays = [array for array in iterator]
# if more files are added that match the glob, this test needs to be updated
assert len(arrays) == 2


@pytest.mark.network
@pytest.mark.xrootd
@pytest.mark.parametrize(
"handler",
[
uproot.source.fsspec.FSSpecSource,
None,
],
)
def test_fsspec_globbing_xrootd_no_files(handler):
pytest.importorskip("XRootD")
pytest.importorskip("fsspec_xrootd")
iterator = uproot.iterate(
{
"root://eospublic.cern.ch//eos/root-eos/cms_opendata_2012_nanoaod/*/ThisFileShouldNotExist.root": "Events"
},
["PV_x"],
handler=handler,
)
with pytest.raises(FileNotFoundError):
arrays = [array for array in iterator]


@pytest.mark.parametrize(
"handler",
[
uproot.source.fsspec.FSSpecSource,
None,
],
)
def test_fsspec_globbing_s3(handler):
pytest.importorskip("s3fs")
if sys.version_info < (3, 11):
pytest.skip(
"https://github.com/scikit-hep/uproot5/pull/1012",
)

iterator = uproot.iterate(
{"s3://pivarski-princeton/pythia_ppZee_run17emb.*.root": "PicoDst"},
["Event/Event.mEventId"],
anon=True,
handler=handler,
)

# if more files are added that match the glob, this test needs to be updated
arrays = [array for array in iterator]
assert len(arrays) == 1
for array in arrays:
assert len(array) == 8004


@pytest.mark.parametrize(
"handler",
[
uproot.source.fsspec.FSSpecSource,
None,
],
)
def test_fsspec_globbing_http(handler):
pytest.importorskip("aiohttp")

# Globbing does not work with http filesystems and will return an empty list of files
# We leave this test here to be notified when this feature is added
iterator = uproot.iterate(
{
"https://github.com/scikit-hep/scikit-hep-testdata/raw/main/src/skhep_testdata/data/uproot-issue*.root": "Events"
},
["MET_pt"],
handler=handler,
)
with pytest.raises(FileNotFoundError):
arrays = [array for array in iterator]

0 comments on commit e64da31

Please sign in to comment.