Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: fsspec documentation #1074

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![PyPI version](https://badge.fury.io/py/uproot.svg)](https://pypi.org/project/uproot)
[![Conda-Forge](https://img.shields.io/conda/vn/conda-forge/uproot)](https://github.com/conda-forge/uproot-feedstock)
[![Python 3.7‒3.11](https://img.shields.io/badge/python-3.7%E2%80%923.11-blue)](https://www.python.org)
[![Python 3.8‒3.12](https://img.shields.io/badge/python-3.8%E2%80%923.12-blue)](https://www.python.org)
[![BSD-3 Clause License](https://img.shields.io/badge/license-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
[![Continuous integration tests](https://github.com/scikit-hep/uproot5/actions/workflows/build-test.yml/badge.svg)](https://github.com/scikit-hep/uproot5/actions)

Expand Down
89 changes: 89 additions & 0 deletions docs-sphinx/basic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1251,3 +1251,92 @@ In addition, each TBranch of the TTree can have a different compression setting:
{'x': None, 'ny': None, 'y': ZLIB(4)}

Changes to the compression setting only affect TBaskets written after the change (with :ref:`uproot.writing.writable.WritableTree.extend`; see above).

Using fsspec for reading and writing files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this documentation should be user-oriented, we might even go so far as to not call this section "Using fsspec," but "File access through remote protocols." In the text, you can say that we use fsspec to do it, and therefore direct users to fsspec if they need details. But the section header should be useful for people who want to read or write remote files and don't know how, or don't even know if Uproot can do it.

--------------------------

Since version `5.2.0 <https://github.com/scikit-hep/uproot5/releases/tag/v5.2.0>`_, uproot supports reading and writing files using `fsspec <https://filesystem-spec.readthedocs.io/en/latest/>`_.
This allows you to read and write files from a variety of sources, including cloud storage, HTTP, and more.
Comment on lines +1258 to +1259
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really needed—readers of documentation ought to presume that it describes the most recent version of the code. (/latest/ is in the URL.)


Usage of fsspec as a source is the default behaviour since 5.2.0, but the user is able to manually specify the source by passing a `uproot.source.chunk.Source` class to the `handler` argument of different uproot methods, such as `uproot.open`, `uproot.iterate`, `uproot.concatenate`, etc.

In general the user should not need to worry about the source, as uproot will automatically choose the best source for the given path.
Comment on lines +1261 to +1263
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users don't need to worry about it, then it doesn't make sense to bring it up here. In the diátaxis taxonomy, this basic.rst page is in the tutorials/learning corner:

So anything detailed can be relegated to the docstrings (the information/reference corner).


In some cases it may provide a performance benefit to manually specify the source, for example when opening a file from a local path, specifying `handler=uproot.source.file.MemmapSource` (instead of the default `handler=uproot.source.fsspec.FSSpecSource`) may reduce the time to open the file at the cost of using more memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of thing would be better to address in GitHub Discussions, after it comes up. These are the sorts of concerns that only matter at scale, and some users worry about all of these things because they've read about it, but their file is only a few MB in size.


Any fsspec protocol should work for reading, while only the protocols supporting writing will work for writing.

fsspec is a dependency of uproot, but in order to use some protocols, the user may need to install additional dependencies.
For example, in order to open S3 files, the user needs to have `s3fs <https://github.com/fsspec/s3fs>`_ installed.
When attempting to open a file with a protocol that is not supported, uproot will raise an exception with a helpful message pointing towards the missing dependency.

For some protocols, such as `s3` or `ssh`, fsspec may need additional options, such as credentials. These can be directly passed as keyword arguments to the uproot function, and will be passed to fsspec.
Comment on lines +1267 to +1273
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good information, though it should come after a basic "how to" sentence. The first thing readers of this documentation should see is a statement saying that they can use any remote protocol that fsspec (link) knows about by writing a URL schema, such as "https://", "root://", or "s3://".

Second, you'd say that they might be prompted to load additional dependencies.

Third, you'd say that some of these protocols will work for writing; see the fsspec documentation for details.

Fourth, that some of them require additional arguments, and they can be passed as keyword **options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on writing could also indicate that this is possible, but likely inefficient. The best way to write a file and host it remotely is to write it locally and upload the final result.


Keep in mind that there might be different libraries that implement a given fsspec backend. This might lead to errors when using uproot. For example, the fsspec ssh tests assume `paramiko <https://github.com/paramiko/paramiko>`_ is installed, but another library such as `sshfs <https://github.com/fsspec/sshfs>`_ might be present instead which also adds ssh support but might behave differently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be dealt with in GitHub Discussions or Issues, when users run into it.

(I'm assuming that the majority won't, and the tutorial documentation has to be streamlined for the majority/)


Reading
~~~~~~~

Opening a file via S3:

.. code-block:: python

>>> with uproot.open("s3://pivarski-princeton/pythia_ppZee_run17emb.picoDst.root:PicoDst",
>>> anon=True) as f:
>>> ...

In this case, the `anon=True` option is required by `s3fs <https://github.com/fsspec/s3fs>`_ to open the file (if aws credentials are not set).

Opening a file via SSH:

In order to open a file over SSH, `paramiko <https://github.com/paramiko/paramiko>`_ needs to be installed (technically any other library that implements the protocol for fsspec would work, such as `sshfs <https://github.com/fsspec/sshfs>`_ for ssh).

Some parameters can be directly passed in the url scheme, such as ssh user and host:

.. code-block:: python

>>> with uproot.open("ssh://user@host:port/file.root") as f:
>>> ...

File globbing
~~~~~~~~

Some protocols support glob expressions, which can be used in the same way they are used in the local filesystem.

Opening multiple files via globbing over XROOTD:

.. code-block:: python

>>> iterator = uproot.iterate("root://host.domain.com/path/to/files/*.root")

Not all protocols that support reading support globbing, for example, http does not support globbing and will return an empty list of files instead.

This feature comes directly as a consequence of the fsspec integration, so requests for globbing support should be directed to fsspec or the specific protocol implementation (it may not be technically possible for some protocols).

Writing
~~~~~~~

The same syntax used for writing uproot files can be used for writing files over different protocols via fsspec.
Just specify the protocol in the path (`ssh://...`) and any necessary options as keyword arguments.
If the protocol does not support writing, a `NotImplementedError` will be raised.

Local cache
~~~~~~~~~~~

fsspec supports caching files locally, which can be useful for repeated access to the same file. It can also be used for remote writing files, to avoid writing to the remote file until the file is closed. Additional information is available `in the fsspec docs <https://filesystem-spec.readthedocs.io/en/latest/features.html?highlight=simplecache#caching-files-locally>`_.

For example, the following code will download the whole file to a local cache directory:

.. code-block:: python

>>> with uproot.open("simplecache::http://host:port/file.root") as f:
>>> ...

This improves read speed at the cost of waiting for the whole file to download and the increase in disk usage.

The following fsspec option can be used to specify the cache directory:

.. code-block:: python

>>> with uproot.open("simplecache::http://host:port/file.root", simplecache={"cache_storage": cache_path}) as f:
>>> ...