Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim notebook large output for better performance #9594

Merged
merged 12 commits into from
Jan 21, 2021

Conversation

echarles
Copy link
Member

References

Fixes #9593

Code changes

Add a maximumTopBottomOutput notebook setting and trim outputs that are larger than a predefined multiple of that maximumTopBottomOutput value. A relevant message is shown to the user. If the user runs again the cell, he will get the complete output.

User-facing changes

Opening a notebook, the user will be shown with a message for large outputs.

Screenshot 2021-01-11 at 12 17 33

Screenshot 2021-01-11 at 12 26 03

Backwards-incompatible changes

None.

@jupyterlab-dev-mode
Copy link

Thanks for making a pull request to JupyterLab!

To try out this branch on binder, follow this link: Binder

@echarles echarles changed the base branch from master to 2.3.x January 11, 2021 11:51
@echarles echarles changed the title Performance/trim output Trim notebook large output for better performance Jan 11, 2021
@echarles
Copy link
Member Author

@goanpeca @isabela-pf Thx for your reviews and feedbacks.

Copy link
Contributor

@mlucool mlucool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: move this to the body. Putting it in the ctor will help with page load, but won't prevent a notebook slowdown from a large number of DOM nodes accidently being added when a cell outputs too much content.

<div style="margin: 10px"
<pre>Output of this cell has been trimmed on the initial display.</pre>
<pre>Total outputs is ${model.length}, displaying the first ${maximumTopBottomOutput} top and last ${maximumTopBottomOutput} bottom outputs.</pre>
<pre>Run again this cell to get the complete output.</pre>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is adding button to click to show all output for this cell. Maybe that's v2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about that indeed for a v2 iteration.

for (let i = 0; i < model.length; i++) {
const output = model.get(i);
this._insertOutput(i, output);
const maximumTopBottomOutput = options.maximumTopBottomOutput || 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if options.maximumTopBottomOutput === 0 disabled this feature so it is backward compatible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! will do

@echarles
Copy link
Member Author

Suggestion: move this to the body. Putting it in the ctor will help with page load, but won't prevent a notebook slowdown from a large number of DOM nodes accidently being added when a cell outputs too much content.

@mlucool Can you elaborate a bit on this. I am not sure to fully understand. Thx

@mlucool
Copy link
Contributor

mlucool commented Jan 11, 2021

@mlucool Can you elaborate a bit on this. I am not sure to fully understand. Thx

My understanding of this change is that when a user opens a notebook, only the first and last N items are shown for a cell. The problem is if you run a cell and it outputs 10k outputs, a user's notebook may still freeze up due to the number of outputs rendered. By moving the logic out of the ctor (e.g. into _insertOutput) it saves the user from a footgun.

@echarles
Copy link
Member Author

My understanding of this change is that when a user opens a notebook, only the first and last N items are shown for a cell. The problem is if you run a cell and it outputs 10k outputs, a user's notebook may still freeze up due to the number of outputs rendered. By moving the logic out of the ctor (e.g. into _insertOutput) it saves the user from a footgun

OK, I see. I was thinking the run as being a workaround for the user to still see the output waiting on the next dev iteration that would show up a button to reveal the unhidden content. I may come up tomorrow with both: moving to _insertOutput have a button (need to test a bit how to implement that and how it potentially affects performance)

@echarles
Copy link
Member Author

echarles commented Jan 12, 2021

@mlucool I have implemented the reveal of the trimmed outputs when the user clicks on the message + the backwards compatibility if the maximumTopBottomOutput is zero (or negative) (see screencast here after.

For the logic move to the insertOutput so that outputs are also trimmed on code execution, I came to read again https://jupyter-client.readthedocs.io/en/stable/messaging.html to make sure we can trigger the end of the execution. I have not (yet) implemented that yet as the output area would need to watch the shell channel (which is does not for now) for a execute_reply message (which it does not for now). This drives me to 2 questions:

  1. As as user, if I request a cell execution, I explicitly want all the outputs (but that is just me, I guess asking to more users will generate a variety of answers).
  2. There is a risk that the execute_reply message does not come through, and that the output area will not show the terminating lines (@Carreau can may be tell more on that risk).

Kapture 2021-01-12 at 14 07 07

@mlucool
Copy link
Contributor

mlucool commented Jan 12, 2021

@mlucool I have implemented the reveal of the trimmed outputs when the user clicks on the message + the backwards compatibility if the maximumTopBottomOutput is zero (or negative) (see screencast here after.

Looks great!

As as user, if I request a cell execution, I explicitly want all the outputs (but that is just me, I guess asking to more users will generate a variety of answers).

I don't think this is always true. If you do x + 1, then certainly show the output. If you run run_work_on_100000_machines() and all 100k return the same error, you really only want the first and last few. In general, at some point too much output is more noise then value (lab's performance will degrade)

There is a risk that the execute_reply message does not come through, and that the output area will not show the terminating lines (@Carreau can may be tell more on that risk).

To be clearer, I am suggesting as data comes in we do something like this:

Message 1->2N output as is (no change)
Message 2N+1: Output: 1...N, Message, N+2...2N+1
Message 2N+2: Output: 1...N, Message, N+3...2N+2
...
Message 3N: Output: 1...N, Message, 2N...3N

That is, as a each new message comes in, start replacing the oldest of the tail after some threshold.

@echarles
Copy link
Member Author

I don't think this is always true. If you do x + 1, then certainly show the output. If you run run_work_on_100000_machines() and all 100k return the same error, you really only want the first and last few. In general, at some point too much output is more noise then value (lab's performance will degrade)

True.

That is, as a each new message comes in, start replacing the oldest of the tail after some threshold.

Sounds like a great idea . The tail will change in live, if the user wants to see the body, he can always click on the message.

Let me give it a try.

@Carreau
Copy link
Contributor

Carreau commented Jan 12, 2021

2. There is a risk that the execute_reply message does not come through, and that the output area will not show the terminating lines (@Carreau can may be tell more on that risk).

Most of the risk is if the network breaks; or if the kernel crash; so I think that should be fine.

Note that there is also:

--NotebookApp.iopub_data_rate_limit=<Float>
    (bytes/sec) Maximum rate at which stream output can be sent on iopub before
    they are limited.
    Default: 1000000
--NotebookApp.iopub_msg_rate_limit=<Float>
    (msgs/sec) Maximum rate at which messages can be sent on iopub before they
    are limited.
    Default: 1000

But those limit the rate to protect against accidental while True loop in teaching context.
Most o those might also be completely fixed with a server side model in the long run.

@echarles
Copy link
Member Author

echarles commented Jan 13, 2021

The tail of the output area is now showing the new outputs in a "stream" way on cell run.

Kapture 2021-01-13 at 16 10 23

@echarles
Copy link
Member Author

FYI I am working to take into account the kernel execute_reply so the above behavior is also applicable to new cells.

@ellisonbg
Copy link
Contributor

A few comments:

  • In the first versions of JupyterLab we explored more aggressive output collapsing and got some pretty strong pushback. I don't have a link to the specific issues but am hesitant to ship something that doesn't address users who are used to how the classic notebook works with large output.
  • Requiring a user to rerun a cell is going to introduce a massive amount of pain. The reason is that long output is often associated with a long running cell. Image how frustrating it would be to run a cell overnight, not see the output you need to see, and having to rerun the cell again. While I am open the this type of approach overall, I think we have to have an approach that allows the user to perform a simple action (click a button) to instantly view all the output.
  • Because output can be a mix of visual and textual, I think it is important for us to work hard on the visual treatment of this UI.
  • Not sure 10 is large enough. Do we have any performance data to help guide our decision on the right cut point?
  • How is the limit of 10 calculated? What counts towards that limit?

@echarles
Copy link
Member Author

echarles commented Jan 13, 2021

In the first versions of JupyterLab we explored more aggressive output collapsing and got some pretty strong pushback. I don't have a link to the specific issues but am hesitant to ship something that doesn't address users who are used to how the classic notebook works with large output.

If this is a show-stopper, we can set the setting maximumTopBottomOutput to 0 to ensure backwards compabitility

Requiring a user to rerun a cell is going to introduce a massive amount of pain. The reason is that long output is often associated with a long running cell. Image how frustrating it would be to run a cell overnight, not see the output you need to see, and having to rerun the cell again. While I am open the this type of approach overall, I think we have to have an approach that allows the user to perform a simple action (click a button) to instantly view all the output.

User does not need to rerun to see the output, he just has to click on the message to see the trimmed outpus.

Because output can be a mix of visual and textual, I think it is important for us to work hard on the visual treatment of this UI.

This aims to address outputs with many top divs (eg. more than 100 ), so even the cell has multiple visual outputs, you would not get those 100 visuals.

Not sure 10 is large enough. Do we have any performance data to help guide our decision on the right cut point?

We have benchmarks in https://github.com/jupyterlab/benchmarks to generate such figures, but this would take quite some work to run them and extract them details. 10 sounds like a reasonable number. We could go to 20 to be sure we don't strip out most of the cells. This is also configurable and the goal is to update that number based on the subsequent RC releases.

How is the limit of 10 calculated? What counts towards that limit?

See previous answer.

@echarles
Copy link
Member Author

BTW I need to update the shown message as it says users has to rerun to see the outputs: User has just to click to see the outputs.

@mlucool
Copy link
Contributor

mlucool commented Jan 13, 2021

I think maximumTopBottomOutput is hard to understand as setting this to 10 give you 20 outputs. It may be clearer to name this maxNumberOutputs and use maximumTopBottomOutput/2 on each side of the message?

@echarles
Copy link
Member Author

@mlucool I have changed the setting name to maxNumberOutputs. While ensuring the trimmed output was working fine for a new cell listening to the execute_reply message, I came across the following 2 issues.

First, let's consider the number of output messages will be 17 and that it is the first time the user runs the cell,.

  • In that case, the model is not yet populated, and we also don't know in advance it will be 17.
  • When the cell is run, the kernel message are coming in. We can monitor and say that after the 10th, we show the clickable info message, and only show the last message to show the activity (see previous screencast).
  • Then we receive the 15th message and with the execute_reply kernel message, we know the execution is over. In that case, we would need to remove the clickable info message and display the 7 last outputs.

As a user experience, this is not great as we show and hide directly the clickable info message.

There is also a second issue. When listening on the execute_reply kernel message, the output area is still inserting widgets after the reception of that kernel message (see the following example in case of 1000 outputs). It is difficult/impossible to exactly know when the latest message is received.

widget.ts:458 ---- 873
widget.ts:458 ---- 874
widget.ts:458 widget.ts:636 --- {header: {…}, msg_id: "5043dda5-97048cdeaef6c9d9327788ff_3044", msg_type: "execute_reply", parent_header: {…}, metadata: {…}, …}
widget.ts:458 ---- 876
widget.ts:458 ---- 877
widget.ts:458 ---- 878
widget.ts:458 ---- 879

Unless we find a way to overcome the above 2 issues, I would suggest to revert back to the initial approach where we only show the message on already populated cells. When the user will open the notebook, we will get performance gain as we can apply the trimmed logic.

The downside is that the user will still see large outputs on runCellor runAllCells, but in terms of performance for the notebook rendering, I don't see real issue as the notebook is already loaded. We only update the DOM for part of the cell outputs, which is OK for performance.

WDYT?

@mlucool
Copy link
Contributor

mlucool commented Jan 14, 2021

I don't see real issue as the notebook is already loaded.

I opened the notebook in your second screenshot in the first update (i.e. the one that says NameError) in both Lab and Notebook. Even after the pageload, notebook says somewhat usable and I can rerun/clear a cell. Lab basically becomes not responsive to most actions. While I don't have hard data on this, I have a hunch this is due to the CSS/wrapping each in lumino widgets. This means that making this PR work for both load and runtime is pretty important to making lab stable. This state isn't hard to get into if you have a cell that does something highly parallelized and they all have errors.

To address the two issues you raised:
Assume N=20 and we show N/10 before/after. I think the problem is you aren't always showing the last N/2. That is, this should always have the correct output even if you don't know if more messages will come. So in the case described 1-17 should be rendered. This holds for all outputs 1-20. At output 21, you'd see 1-10, msg, 11-21. Then at output 22 it would show 1-10, msg, 12-22. There is never a time when this would need future information to display correctly.

This means that the clickable info is never removed because it is only added after we are sure it will be needed. This should also remember if that if a user clicked show all output when output 21 came, then output 22 comes, we remember not to hide anything.

@ellisonbg
Copy link
Contributor

ellisonbg commented Jan 14, 2021 via email

@goanpeca
Copy link
Member

  • Are we storing the state related to this in the notebook metadata?

Nope.

What
happens if a user reopens a notebook?

By default the output cells will be hidden unless the users clicks the message.

  • How does this state interact with the other state (scrolling, collapsed)
    that determines how outputs are rendered and handled?

Good question, not sure!

@echarles
Copy link
Member Author

How does this state interact with the other state (scrolling, collapsed) that determines how outputs are rendered and handled?

It the cell output is collapsed, the user can ucollapse it. Depending on the output length, the cell will be shown with or without the trimmed outputs. The some outputs would have been trimmed, the user can still then click on the message and the them like before.

For the scroll behavior, the trimmed outputs are not show and not in the DOM, so they will not be impacted. Only on reveal, they would be.

@echarles
Copy link
Member Author

@goanpeca Thx a lot for your review. I have pushed the needed changes.

@ellisonbg Thx for the questions. Anything else we need to provide? I propose to leave this PR open for a few days (this Firday and weekend, so until Monday 18 Jan included). We'd like then if no more concern is raised, to get this merged to release a JupyterLab 2.3.0rc0 version with this in as discussed during the last weekly JupyterLab meeting.

@echarles
Copy link
Member Author

@goanpeca This PR is ready for final review (lint is fixes, the other notebook js failure is unrelated to these changes and linked to flaky tests previously seen on the 2.x branches)

Copy link
Member

@goanpeca goanpeca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this work @echarles !

All works as expected

@echarles
Copy link
Member Author

echarles commented Jan 20, 2021

@ellisonbg From today community meeting, I understand you may have improvement proposals to make sure the message is seen by the user (or via CSS or via right-click menu) and you also said that it should prolly not block the merge as we can iterate on the message quality in the next Release Candidates.

We have planned resources to release tomorrow 2.3.0RC0. As discussed, it will be much appreciated to get any feedback (merge as it or not). As option, we can set the setting maxNumberOutputs to 0 that would ensure backwards compatibility (all outputs will be show, no message to the user) but we would prefer having the value set to a non-zero value. We could maybe go to 100 instead of 20 so that less cells fall into the trimmed output case.

@ellisonbg
Copy link
Contributor

I am supportive of this being merged in its current state (pending any code reviews) with a maxNumberOutputs set to a non-zero value (maybe a compromise of 50). Longer term I think we should explore how we can make this more usable:

  • Improved styling of the message.
  • Context menu items on the output and notebook to show all the outputs.
  • A indicator on the output prompt area to show the user there is more output that is being hidden so that can tell this without scrolling as much.

Nice work though!

@echarles
Copy link
Member Author

Thx @ellisonbg for your intake and as usual great inputs, reviews and feedbacks to make jupyterlab a greater piece of software for the users.

@isabela-pf At some point, we can get advice from you on the actions listed by Brian: Improved styling of the message and/or Context menu items on the output and notebook to show all the outputs and/or A indicator on the output prompt area to show the user there is more output that is being hidden so that can tell this without scrolling as much... andor anything you think would make a great user-experience.

@echarles
Copy link
Member Author

maxNumberOutputs is now set to 50. I have opened #9652 for the user interaction enhancement.

@goanpeca
Copy link
Member

Thanks for this amazing work @echarles and all reviewers for their comments. Merging now.

Will cut a release with @blink1073 in a couple of hours!

Cheers

@goanpeca goanpeca merged commit 2aafad4 into jupyterlab:2.3.x Jan 21, 2021
@isabela-pf
Copy link
Contributor

Just to ping everyone here who might be interested, I commented on #9593 since this PR has been merged.

@uribe-convers
Copy link

Hi @goanpeca and @mlucool,

this feature is very helpful but for the work I do, I need to turn it off. I manage multiple users and would prefer to set the default to maxNumberOutputs: 0 as the users build the docker container with Jupyter.

Is there a way I can do set this default value using the jupyter_notebook_config.py file? I don't see it in the options available in the docs.

Thanks fr the help!
Simon

@mlucool
Copy link
Contributor

mlucool commented Apr 7, 2021

All defaults should be overridable with overrides.json. I too learned about this great feature only recently!

@uribe-convers
Copy link

Thanks @mlucool, that's a cool way of setting user preferences! However, I haven't been able to get it working...
I created a settings dir and overrides.json file in my environment /root/home/my-user/miniconda3/envs/env/share/jupyter/lab/settings/overrides.json with the text: {"maxNumberOutputs": 0}.

I restarted the Jupyter server and refreshed the browser tab but still don't see the changes. Is there anything else I need to do?
Thank you!

@mlucool
Copy link
Contributor

mlucool commented Apr 7, 2021

I didn't test this, but you'll want something like this (you need to scope the settings):

    "@jupyterlab/notebook-extension:tracker": {
         "maxNumberOutputs": 0
     }

@uribe-convers
Copy link

Thanks @mlucool, I appreciate the help! I've tried to put the code below in /root/home/my-user/miniconda3/envs/env/share/jupyter/lab/settings/overrides.json and in /root/home/.jupyter/lab/settings/overrides.json and nothing changes, but I'll keep trying! :)

{
  "@jupyterlab/notebook-extension:tracker": {
    "maxNumberOutputs": 10
    }
}

@jasongrout
Copy link
Contributor

Thanks @mlucool, I appreciate the help! I've tried to put the code below in /root/home/my-user/miniconda3/envs/env/share/jupyter/lab/settings/overrides.json and in /root/home/.jupyter/lab/settings/overrides.json and nothing changes, but I'll keep trying! :)

I would suggest trying the example in the docs (changing the default theme) first to make sure that you have the file location correct first, then working on getting the right setting name.

@uribe-convers
Copy link

uribe-convers commented Apr 8, 2021

yeah, totally @jasongrout—I've tried that too. Thanks

@github-actions github-actions bot added the status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion. label Oct 6, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pkg:cells pkg:notebook pkg:outputarea status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trim large notebook outputs for better performance
8 participants