-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Add tqdm progress bar for downloads #7552
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Signed the CLA! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to get these lines covered in our unit test coverage, though IMO we don't really need to check for actual progress bar updates, maybe just mock it out and check that pbar.update
gets called when tqdm
is installed.
Also, could you add tqdm
to "extras" in setup.py
?
# report progress if tqdm installed | ||
try: | ||
from tqdm import tqdm | ||
pbar = tqdm(desc="Downloading", total=self.total_rows, unit="rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do something sensible when total_rows
is not populated? As discussed in #7217 total_rows
is None
until iteration starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll update pbar.total
in the loop if it's unset. tqdm
handles this just fine.
|
||
# report progress if tqdm installed | ||
try: | ||
from tqdm import tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aim for 100% unit test coverage, so could you follow the pattern that we do with the optional pandas dependency where we import it at the top of the module and catch import errors there?
That way we can install tqdm for our unit tests but then mock it out in another test to check that it doesn't fail when the tqdm module can't be loaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
Looks great, thanks for the contribution! Just need to make sure we get test coverage and we're good-to-go, I think. |
f1b1fa0
to
b23d581
Compare
I think that covers your comments @tswast, would you mind taking another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Thanks for the tests. Just a few nits, mostly regarding var names and comments.
|
||
if pbar is not None: | ||
pbar.total = pbar.total or self.total_rows | ||
# update progress bar with number of rows in last frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is unnecessary, since it just states what is happening in the next line. Comments in this codebase should state why or explain something that looks "wrong".
This comment indicates to me that we need to rename pbar
to progress_bar
. Also, we could add some intermediate variables above to make this line more self-explanatory.
frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))
|
turns into
|
V
current_frame = self._to_dataframe_dtypes(page, column_names, dtypes)
frames.append(current_frame)
...
progress_bar.update(len(current_frame))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome feedback, thank you :)
for page in iter(self.pages): | ||
frames.append(self._to_dataframe_dtypes(page, column_names, dtypes)) | ||
|
||
if pbar is not None: | ||
pbar.total = pbar.total or self.total_rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a comment here explaining why we are setting the total (again).
# In some cases, the number of total rows is not populated
# until the first page of rows is fetched. Update the
# progress bar's total to keep an accurate count.
pbar.total = pbar.total or self.total_rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
desc="Downloading", total=self.total_rows, unit="rows" | ||
) | ||
except (KeyError, TypeError): | ||
# tqdm error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like this comment to explain a little more why these errors might happen. And more importantly why we are letting them pass (because a broken progress bar shouldn't stop us from downloading results).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking here was to protect from something like an interface change in tqdm
. Indeed the fallback should just be to not show a progress bar.
bigquery/tests/unit/test_table.py
Outdated
row_iterator = RowIterator(_mock_client(), api_request, path, schema) | ||
df = row_iterator.to_dataframe() | ||
|
||
self.assertFalse(len(df) == 0) # all should be well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we control the API results, let's be a little more precise and use self.assertEqual(len(df), 4)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, done!
Looks great. Thanks so much for the contribution @JohnPaton . I'll kick off the tests and merge once they pass. |
This is cool! @tswast do you know how this will affect the notebook display? |
Some discussion is here: tqdm/tqdm#443 |
Sorry hit "close" instead of "comment" there |
@alixhami Here's some GIFs.
I think for all but magics, it's beautiful. Magics is a little gross, but still an improvement compared to having it hang forever while downloading big results. |
Thinking about the notebook implications of this and the in-progress project of creating a notebook testing tool that will try to run notebooks without error, would it be possible to avoid using stderr? Also, it looks like this will always show the progress bar if the user has tqdm installed, but would it make sense to have the progress bar as an optional parameter on the |
I've just found this update which I didn't previously know about: tqdm/tqdm@97a9393#diff-bf59de82e6ce121b5213bacf25304eb2 Looks like What do you think @tswast, worth an update? I think it looks pretty good actually. Difficult to test though 🙈 |
The fact that the Edit: No reason for Edit 2: Probably Edit 3: Let's default to Ignore below The
becomes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last thing, sorry for all the edit noise.
Oh, and there are some lint and coverage problems in the CI build. I can handle those if you'd prefer. I know it can be a bit of a pain to set up nox / our linter tool and everything.
|
||
# report progress if tqdm installed | ||
progress_bar = None | ||
if tqdm is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a progress_bar_type
(string) parameter to to_dataframe
and check that it also is not None.
It should default to None
for now, until we update the magics module to support turning it off in the Context
and also update the magics to use the tqdm_notebook
option instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should default to None for now.
Actually, I think defaulting to 'tqdm'
is fine. When @alixhami tests notebooks, he can set google.cloud.bigquery.table.tqdm = None
like you do in your tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should default to None
because otherwise it will throw errors for users who don't have tqdm
installed who aren't using the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the latest commits.
Okay, so to summarize, we'll add an optional parameter I indeed don't have the linter set up, nox was giving me errors so I've just been |
Good summary. Yes, I can handle the lint and coverage issues. |
Thanks a bunch for the contribution @JohnPaton. I just merged this PR, it should be available in the next release of the |
Awesome! Happy to help :) |
As discussed here with @tswast, this PR adds a
tqdm
progress bar for monitoring table downloads. The progress bar is updated with the downloaded number of rows after eachpage
has loaded, and informs the user how many rows in total are to be downloaded.This PR does not add
tqdm
as a dependency, it skips the progress bar iftqdm
is not installed, or if any of thetqdm
errors are raised during progress bar construction.