Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Add tqdm progress bar for downloads #7552

Merged
merged 12 commits into from
Mar 28, 2019

Conversation

JohnPaton
Copy link
Contributor

As discussed here with @tswast, this PR adds a tqdm progress bar for monitoring table downloads. The progress bar is updated with the downloaded number of rows after each page has loaded, and informs the user how many rows in total are to be downloaded.

This PR does not add tqdm as a dependency, it skips the progress bar if tqdm is not installed, or if any of the tqdm errors are raised during progress bar construction.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no This human has *not* signed the Contributor License Agreement. label Mar 23, 2019
@JohnPaton
Copy link
Contributor Author

Signed the CLA!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Mar 23, 2019
@JohnPaton JohnPaton changed the title Add tqdm progress bar for BigQuery downloads BigQuery: Add tqdm progress bar for downloads Mar 23, 2019
Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to get these lines covered in our unit test coverage, though IMO we don't really need to check for actual progress bar updates, maybe just mock it out and check that pbar.update gets called when tqdm is installed.

Also, could you add tqdm to "extras" in setup.py?

# report progress if tqdm installed
try:
from tqdm import tqdm
pbar = tqdm(desc="Downloading", total=self.total_rows, unit="rows")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this do something sensible when total_rows is not populated? As discussed in #7217 total_rows is None until iteration starts.

Copy link
Contributor Author

@JohnPaton JohnPaton Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll update pbar.total in the loop if it's unset. tqdm handles this just fine.


# report progress if tqdm installed
try:
from tqdm import tqdm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aim for 100% unit test coverage, so could you follow the pattern that we do with the optional pandas dependency where we import it at the top of the module and catch import errors there?

That way we can install tqdm for our unit tests but then mock it out in another test to check that it doesn't fail when the tqdm module can't be loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@tswast
Copy link
Contributor

tswast commented Mar 23, 2019

Looks great, thanks for the contribution! Just need to make sure we get test coverage and we're good-to-go, I think.

@JohnPaton
Copy link
Contributor Author

I think that covers your comments @tswast, would you mind taking another look?

@tseaver tseaver added api: bigquery Issues related to the BigQuery API. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Mar 27, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 27, 2019
Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Thanks for the tests. Just a few nits, mostly regarding var names and comments.


if pbar is not None:
pbar.total = pbar.total or self.total_rows
# update progress bar with number of rows in last frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is unnecessary, since it just states what is happening in the next line. Comments in this codebase should state why or explain something that looks "wrong".

This comment indicates to me that we need to rename pbar to progress_bar. Also, we could add some intermediate variables above to make this line more self-explanatory.

frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))

     |
turns into
     |
     V

current_frame = self._to_dataframe_dtypes(page, column_names, dtypes)
frames.append(current_frame)
...
    progress_bar.update(len(current_frame))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome feedback, thank you :)

for page in iter(self.pages):
frames.append(self._to_dataframe_dtypes(page, column_names, dtypes))

if pbar is not None:
pbar.total = pbar.total or self.total_rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment here explaining why we are setting the total (again).

	                # In some cases, the number of total rows is not populated
	                # until the first page of rows is fetched. Update the
	                # progress bar's total to keep an accurate count.
	                pbar.total = pbar.total or self.total_rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

desc="Downloading", total=self.total_rows, unit="rows"
)
except (KeyError, TypeError):
# tqdm error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like this comment to explain a little more why these errors might happen. And more importantly why we are letting them pass (because a broken progress bar shouldn't stop us from downloading results).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here was to protect from something like an interface change in tqdm. Indeed the fallback should just be to not show a progress bar.

row_iterator = RowIterator(_mock_client(), api_request, path, schema)
df = row_iterator.to_dataframe()

self.assertFalse(len(df) == 0) # all should be well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we control the API results, let's be a little more precise and use self.assertEqual(len(df), 4).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, done!

@tswast tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 27, 2019
@tswast
Copy link
Contributor

tswast commented Mar 27, 2019

Looks great. Thanks so much for the contribution @JohnPaton . I'll kick off the tests and merge once they pass.

@tswast tswast self-assigned this Mar 27, 2019
@alixhami
Copy link
Contributor

This is cool! @tswast do you know how this will affect the notebook display?

@JohnPaton
Copy link
Contributor Author

JohnPaton commented Mar 27, 2019

tqdm is a bit unpredictable in notebooks unfortunately. Ideally the progress bar just updates in stderr under the cell but if you run the same cell several times it sometimes has issues with the carriage return and each update ends up printing a new line. tqdm does have a special function to render the progress bar nicely in notebooks but it would require checking the context to see if we're in a notebook somehow.

Some discussion is here: tqdm/tqdm#443

@JohnPaton JohnPaton closed this Mar 27, 2019
@JohnPaton JohnPaton reopened this Mar 27, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 27, 2019
@JohnPaton
Copy link
Contributor Author

Sorry hit "close" instead of "comment" there

@tswast
Copy link
Contributor

tswast commented Mar 27, 2019

@alixhami Here's some GIFs.

(1) running a python script,

script

(2) running in interactive python

interactive-python

(3) notebook with the function called directly,

notebook

(4) notebook with magics

notebook-magics

I think for all but magics, it's beautiful. Magics is a little gross, but still an improvement compared to having it hang forever while downloading big results.

@alixhami
Copy link
Contributor

Thinking about the notebook implications of this and the in-progress project of creating a notebook testing tool that will try to run notebooks without error, would it be possible to avoid using stderr?

Also, it looks like this will always show the progress bar if the user has tqdm installed, but would it make sense to have the progress bar as an optional parameter on the to_dataframe() function? That way it could be used by pandas-gbq, but we could opt out of it for magics. It would also allow notebook users to avoid it if it's buggy there.

@JohnPaton
Copy link
Contributor Author

JohnPaton commented Mar 27, 2019

I've just found this update which I didn't previously know about: tqdm/tqdm@97a9393#diff-bf59de82e6ce121b5213bacf25304eb2

Looks like tqdm started supporting automatically doing the nice notebook progress bar display if a notebook is detected, you just need to import the function from a different submodule. There are some gifs of the nicer version here.

What do you think @tswast, worth an update? I think it looks pretty good actually. Difficult to test though 🙈

@tswast
Copy link
Contributor

tswast commented Mar 27, 2019

The fact that the autonotebook module always writes a TqdmExperimentalWarning bugs me a bit. What if we added a progress_bar_type progress_bar_constructor property to RowIterator argument to RowIterator.to_dataframe that defaults to tqdm.autonotebook if tqdm is installed, but is settable to something else? None (to turn it off for now). Other valid options include 'tqdm', 'tqdm_notebook', 'tqdm_gui'.

Edit: No reason for progress_bar_constructor to be a property to RowIterator, as it only affects to_dataframe.

Edit 2: Probably progress_bar_constructor should actually be progress_bar_type and be a string. That way it can default to tqdm and fallback to doing nothing if tqdm isn't installed.

Edit 3: Let's default to None (no progress bar) and then switch it to 'tqdm' by default in a subsequent PR that updates the magics. (I can handle that if you'd prefer, John.)

Ignore below

The %%bigquery magic code in the magics.py module, we add a Context.progress_bar_constructor property so that tqdm can be turned off for testing notebooks.

result = query_job.to_dataframe()

becomes

rows = query_job.result()
rows.progress_bar_constructor = context.progress_bar_constructor
result = rows.to_dataframe()

Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last thing, sorry for all the edit noise.

Oh, and there are some lint and coverage problems in the CI build. I can handle those if you'd prefer. I know it can be a bit of a pain to set up nox / our linter tool and everything.


# report progress if tqdm installed
progress_bar = None
if tqdm is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a progress_bar_type (string) parameter to to_dataframe and check that it also is not None.

It should default to None for now, until we update the magics module to support turning it off in the Context and also update the magics to use the tqdm_notebook option instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should default to None for now.

Actually, I think defaulting to 'tqdm' is fine. When @alixhami tests notebooks, he can set google.cloud.bigquery.table.tqdm = None like you do in your tests.

Copy link
Contributor

@alixhami alixhami Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should default to None because otherwise it will throw errors for users who don't have tqdm installed who aren't using the parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commits.

@JohnPaton
Copy link
Contributor Author

JohnPaton commented Mar 28, 2019

Okay, so to summarize, we'll add an optional parameter progress_bar_type to RowIterator.to_dataframe, which will be None by default, but will also accept "tqdm", "tqdm_notebook", "tqdm_auto", "tqdm_gui".

I indeed don't have the linter set up, nox was giving me errors so I've just been pytesting the relevant tests. If you would handle the linter that would be great.

@tswast
Copy link
Contributor

tswast commented Mar 28, 2019

Good summary. Yes, I can handle the lint and coverage issues.

@JohnPaton JohnPaton requested a review from a team March 28, 2019 20:42
@tswast tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 28, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 28, 2019
@tswast tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 28, 2019
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 28, 2019
@tswast tswast merged commit 800a6bb into googleapis:master Mar 28, 2019
@tswast
Copy link
Contributor

tswast commented Mar 28, 2019

Thanks a bunch for the contribution @JohnPaton. I just merged this PR, it should be available in the next release of the google-cloud-bigquery library.

@JohnPaton
Copy link
Contributor Author

Awesome! Happy to help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants