Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add API to support tracking progress while reading from a file #1759

Merged
merged 28 commits into from
Mar 30, 2022
Merged

Add API to support tracking progress while reading from a file #1759

merged 28 commits into from
Mar 30, 2022

Conversation

althonos
Copy link
Contributor

@althonos althonos commented Dec 15, 2021

Type of changes

  • Bug fix
  • New feature
  • Documentation / docstrings
  • Tests
  • Other

Checklist

  • I've run the latest black with default args on new code.
  • I've updated CHANGELOG.md and CONTRIBUTORS.md where appropriate.
  • I've added tests for new code.
  • I accept that @willmcgugan may be pedantic in the code review.

Description

Hi Will! I've started using rich more and more, so far it's a great experience 😉

One of my common use case for the rich.progress module is tracking the progress made while reading from a file. In bioinformatics it's not rare to process gigabytes of files so I like to have an idea how long of a break I can take while processing my input data!

image

Manually updating the progress bar works for certain things, but for instance, when I want to pass a file-like handle to an API that consumes it (e.g json.load), there's no way to update the progress bar between each chunk. To do so, the only way I've found was to wrap the file-like object itself so that it updates the progress bar whenever the file is read from. This is something you could partially do with tqdm.tqdm.wrapattr.

Since the rich.progress API has no straightforward support for this (like it does with track for an iterator), I made my own snippet for this, which ends up working well. I've been copying around in every project that uses rich, so I think it would make a good addition to the library!

Changes

This PR adds the rich.progress._Reader class, which wraps a file-like object, and updates a Progress instance while being read from. The class itself is private; to get a reader, you have to use the new Progress.read method, which takes either a path or a file-like object, and returns a file reader with progress support.

This lets you do something like this (e.g. loading JSON-serialized data from a file, using os.stat to get the total number of bytes to be read):

import json
from rich.progress import *

with Progress(columns=[BarColumn(), DownloadColumn()]) as progress:
    with progress.read("data.json") as f:
        data = json.load(f)

You can also directly pass a file-like object to progress.read, in which case you must specify the total:

import json
from rich.progress import *

with Progress(columns=[BarColumn(), DownloadColumn()]) as progress:
    with open("data.json", "rb") as f:
        data = json.load(progress.read(f, total=2048))

In addition, I added a wrapper function rich.progress.read like rich.progress.track which handles setting up a progress, so that you can get a progress reader in just two lines:

with rich.progress.read("data.json") as f:
    data = json.load(f)

Notes

  • In rich.progress.read and Progress.read, I made total an int instead of a float, because it should be a number of bytes so a float would not make sense here.
  • If you seek the _Reader, it will also reset the position of the progress bar to the new position.

@althonos althonos marked this pull request as ready for review December 16, 2021 14:11
@willmcgugan
Copy link
Collaborator

Hi Martin! For a moment there I thought I was on a different repos.

This looks like a great addition. Would you mind updating the docs for this? A single sub-section under progress should do it.

Thanks.

@willmcgugan willmcgugan self-assigned this Jan 5, 2022
@althonos
Copy link
Contributor Author

Rebased against latest master.

@althonos
Copy link
Contributor Author

Hi @willmcgugan , any update on this? 😄

@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2022

Codecov Report

Merging #1759 (5717d8d) into master (7105781) will decrease coverage by 0.39%.
The diff coverage is 85.44%.

@@            Coverage Diff             @@
##           master    #1759      +/-   ##
==========================================
- Coverage   99.46%   99.07%   -0.40%     
==========================================
  Files          72       72              
  Lines        7339     7551     +212     
==========================================
+ Hits         7300     7481     +181     
- Misses         39       70      +31     
Flag Coverage Δ
unittests 99.07% <85.44%> (-0.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
rich/__main__.py 100.00% <ø> (ø)
rich/progress.py 93.51% <77.69%> (-4.92%) ⬇️
rich/console.py 99.21% <100.00%> (+0.06%) ⬆️
rich/terminal_theme.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f43ccc...5717d8d. Read the comment docs.

@willmcgugan
Copy link
Collaborator

Hi Martin,

Sorry about the delay! It's been on the backburner for a while.

Love the idea, but I think we can refine the API a bit to make it easier to grok.

Am I right in saying it is overloaded to either open and return a file / context manager and also to read from an open file?

I suspect that will be tricky to explain to users. Could we separate those so we have rich.progress.open and a rich.progress.wrap_file method (plus equivalent on Progress). I'd also like to support text-files without additional wrapping, as I'm sure that will be a common use case.

What do you think?

@althonos
Copy link
Contributor Author

No worries, I also have things that have been lying around for so long and I don't always get myself up to speed, so I can definitely understand.

At the moment the code will either wrap a file through a context-manager that does need to be closed, or wrap a file-like object in a context-manager that does not need to be closed, and i can understand this may be confusing. I can totally split this into two functions.

@althonos
Copy link
Contributor Author

Okay, I've split the code into two methods:

  • Progress.open, which takes a path and returns a file, it accepts a mode argument to open in text or binary mode, as well as an encoding argument like the builtin open method.
  • Progress.wrap_file which wraps a file-like object.

I also allowed skipping specifiying the total provided that a task_id is given which refers to a task that already had a total set; this way you can do both:

with Progress() as progress:
    # specify a total in a task beforehand
    task_id = progress.add_task("Reading...", total=100)
    with progress.wrap_file(file_handle, task_id=task_id) as reader:
        pass
    # specify a total directly
    with progress.wrap_file(file_handle, total=100) as reader:
        pass

@althonos
Copy link
Contributor Author

Okay, I've updated rich.progress.open and Progress.open to have a signature similar to open (supporting the buffering, newline, encoding and errors arguments).

In addition, I've fixed the type annotations, and added overloaded annotations to determine the return type of rich.progress.open and Progress.open based on the mode (TextIO if "r" or "rt", BinaryIO if "rb").

@willmcgugan
Copy link
Collaborator

@darrenburns Could you give this one a look through. It's a feature to wrap a file so that reading updates a progress bar.

Copy link
Member

@darrenburns darrenburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few small comments, really liking it overall though. Nice idea 😎

Reading from a file
~~~~~~~~~~~~~~~~~~~

You can obtain a progress-tracking reader using the :meth:`~rich.progress.Progress.open` method by giving it a path. You can specify the number of bytes to be read, but by default :meth:`~rich.progress.Progress.open` will query the size of the file with :func:`os.stat`. You are responsible for closing the file, and you should consider using a *context* to make sure it is closed ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the Progress.open API 😎

rich/progress.py Outdated Show resolved Hide resolved
rich/progress.py Outdated Show resolved Hide resolved
rich/progress.py Outdated Show resolved Hide resolved
@althonos
Copy link
Contributor Author

Thanks @darrenburns ! Idea is actually coming from my Rust experience with indicatif, where you can a progress bar for reading a file with a single function call 😄

@darrenburns
Copy link
Member

Thanks for making those changes @althonos! Could you run make format and make typecheck to ensure CI passes?

@althonos
Copy link
Contributor Author

Okay this should be it! I tried running make typecheck but i get an error in rich/_win32_console.py, not in rich.progress, I don't know if that's caused by my PR or not.

Copy link
Member

@darrenburns darrenburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments for your consideration 😄

rich/progress.py Show resolved Hide resolved
rich/progress.py Show resolved Hide resolved
docs/source/progress.rst Outdated Show resolved Hide resolved
rich/progress.py Outdated
Comment on lines 231 to 241
def readall(self) -> bytes:
block = self.handle.readall() # type: ignore
self.progress.advance(self.task, advance=len(block))
return block # type: ignore

def readinto(self, b: Union[bytearray, memoryview, mmap]): # type: ignore
n = self.handle.readinto(b) # type: ignore
self.progress.advance(self.task, advance=n)
return n

def readline(self, size: int = -1) -> bytes: # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the thinking behind these type: ignore comments?

As far as I can see,self.handle which is typed as a BinaryIO doesn't contain readinto or readall (those come from RawIOBase). mypys warnings seem valid to me here.

Indeed, calling readall on the _Reader instance will result in AttributeError: '_io.BufferedReader' object has no attribute 'readall'. Did you mean: 'readable'?

I'm wondering if these methods are required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as far as I've seen while doing I/O related typing in Pyfilesystem2, the I/O typing is really inconsistent between the typing and the io modules.

Basically, when you open a file in binary mode with open, you get either an io.BufferedReader or an io.FileIO (if buffering is disabled), both of which have a readinto method, . However when mypy cannot determine the buffering statically, it will have open default to returning a BinaryIO, which doesn't have these methods, although the actual objects being returned will always have readinto:

https://github.com/python/typeshed/blob/9687d53b65d055a990f0187b7baa1137cb7c717b/stdlib/builtins.pyi#L1331-L1341

That's why I decided to silence mypy here. I can definitely remove the readall implementation, but I'd leave the readinto, in particular because it can really improve performance in some cases by avoiding useless copies of byte data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defer to your experience here, Martin! Regarding the readall, it is probably better to be defensive here even if it defies the typing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's actually safer here to remove readall because I'll expect the majority of calls to progress.open to be buffered, so the returned io.BufferedReader won't have a readall method. It's also much more marginal than readinto in my experience, so I don't think you'll get a lot complaints that _Reader misses a readall method because there's very limited reason to use it instead of read() without arguments.

@willmcgugan
Copy link
Collaborator

@althonos We added a Mypy switch recently that require the error code in any type: ignore, which explains the CI fail.

@willmcgugan
Copy link
Collaborator

Thanks, @althonos ! This is a fantastic feature. Will do a release v soon.

@willmcgugan willmcgugan merged commit 3f61975 into Textualize:master Mar 30, 2022
@althonos
Copy link
Contributor Author

Don't mention it! As usual, thanks a lot both of you for the review, and very excited to use Progress.open as soon as it's out there 😉

@astrochun
Copy link

Still new to this API, but following some of the guides on RTDs, it looks like from rich.progress import read is no longer supported (use open instead). Is that right?

See: https://rich.readthedocs.io/en/latest/progress.html#basic-usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants