Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] long_callback short interval makes job run in loop #1769

Closed
usr-ein opened this issue Sep 21, 2021 · 13 comments · Fixed by #2039
Closed

[BUG] long_callback short interval makes job run in loop #1769

usr-ein opened this issue Sep 21, 2021 · 13 comments · Fixed by #2039
Assignees

Comments

@usr-ein
Copy link

usr-ein commented Sep 21, 2021

I'm really excited to use long_callbacks because it helps a lot for my use-cases, but while reading the doc (Example 5) I found that there exists an undocumented parameter interval which is set to 1000 in this example.

I tried setting it to 1000 too, but the progress updates were too slow, so I progressively lowered it. It behaved as I wanted when setting it to 100 (ms), but when running my job with interval=100, when the job finishes, it is immediatly executed again, even though the launch button's event is not triggered.

Minimal replication code

#!/usr/bin/env python3
import time
import dash
import dash_html_components as html
import dash_core_components as dcc
from dash.long_callback import DiskcacheLongCallbackManager
from dash.dependencies import Input, Output
import plotly.graph_objects as go

## Diskcache
import diskcache
cache = diskcache.Cache("./cache")
long_callback_manager = DiskcacheLongCallbackManager(cache)

def make_progress_graph(progress, total):
    progress_graph = (
        go.Figure(data=[go.Bar(x=[progress])])
        .update_xaxes(range=[0, total])
        .update_yaxes(
            showticklabels=False,
        )
        .update_layout(height=100, margin=dict(t=20, b=40))
    )

    return progress_graph


app = dash.Dash(__name__, long_callback_manager=long_callback_manager)

app.layout = html.Div(
    [
        html.Div(
            [
                html.P(id="paragraph_id", children=["Button not clicked"]),
                dcc.Graph(id="progress_bar_graph", figure=make_progress_graph(0, 10)),
            ]
        ),
        html.Button(id="button_id", children="Run Job!"),
        html.Button(id="cancel_button_id", children="Cancel Running Job!"),
        html.Span(id="heavy-data-span", children="", style={"display": "none"}),
        html.Span(id="output-span", children="", style={"display": "none"}),
    ]
)

@app.long_callback(
    output=[Output("paragraph_id", "children"),Output("output-span", "children"),],
    inputs=Input("button_id", "n_clicks"),
    running=[
        (Output("button_id", "disabled"), True, False),
        (Output("cancel_button_id", "disabled"), False, True),
        (
            Output("paragraph_id", "style"),
            {"visibility": "hidden"},
            {"visibility": "visible"},
        ),
        (
            Output("progress_bar_graph", "style"),
            {"visibility": "visible"},
            {"visibility": "hidden"},
        ),
    ],
    cancel=[Input("cancel_button_id", "n_clicks")],
    progress=[Output("progress_bar_graph", "figure"), Output("heavy-data-span", "children")],
    progress_default=(make_progress_graph(0, 20), "A"),
    interval=38,
    prevent_initial_call=True,
)
def callback(set_progress, n_clicks):
    total = 20

    for i in range(total):
        print(f"Running {i}/{total}...")
        time.sleep(0.3)
        set_progress((make_progress_graph(i, total), str(i%10)*4_000_000))

    return [f"Clicked {n_clicks} times", "A"*4_000_000]


if __name__ == "__main__":
    app.run_server(debug=True)

If you can't replicate this exact looping bug on your setup, try reducing the interval value until you see a looping behaviour with some diskcache I/O errors, and when you do, progressively set it to something higher until the looping remains, but that the error disappear.

Here is what such diskcache I/O errors look like, though in my real application I never witnessed these. There simply were no warnings !

Traceback (most recent call last):
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/multiprocess/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/dash/long_callback/managers/diskcache_manager.py", line 143, in job_fn
    user_callback_output = fn(*maybe_progress, *user_callback_args)
  File "/Users/sam1902/Downloads/test_dash_bug.py", line 78, in callback
    set_progress(make_progress_graph(i, total))
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/dash/long_callback/managers/diskcache_manager.py", line 137, in _set_progress
    cache.set(progress_key, progress_value)
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/diskcache/core.py", line 799, in set
    with self._transact(retry, filename) as (sql, cleanup):
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/contextlib.py", line 117, in __enter__
 58         ),
    return next(self.gen)
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/diskcache/core.py", line 713, in _transact
    sql = self._sql
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/diskcache/core.py", line 651, in _sql
    return self._con.execute
  File "/Users/sam1902/.pyenv/versions/3.9.1/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/diskcache/core.py", line 626, in _con
    con = self._local.con = sqlite3.connect(
sqlite3.OperationalError: disk I/O error

Those are not critical errors causing the server to crash. It simply ends one of the processes, but the callback still loops back. Also, those happen anywhere within the loop (at any iteration, not just at the end).

Environment

pip list | grep dash:

dash                      2.0.0
dash-bootstrap-components 0.13.0
dash-core-components      2.0.0
dash-html-components      2.0.0
dash-table                5.0.0
diskcache                  5.2.1
psutil                     5.8.0
multiprocess               0.70.12.2
  • if frontend related, tell us your Browser, Version and OS

    • OS: macOs 11.5.2
    • Browser: replicated on
      • Firefox 92.0 (i.e. latest on 2021-09-21)
      • Chromium 93.0.4577.82 (i.e. latest on 2021-09-21)

Describe the bug

When setting an interval too small, the callback runs, but once it finishes it runs again, and again etc in a loop fashion.
From my replication attempts, this seems to be due to passing too large values to the long_callback's Output or progress=[Output(..)], either during progress updates or at the end of the callback.

I ran into this in my code because I dealt with large images encoded as data URIs, and the file size here are approximated by "A"*4_000_000, which should weigh around 3.8MB, which is comparable to my image weight.

Expected behavior

The callback should only be called once, no matter the progress callback frequency (i.e. interval time)

Screenshots
As you can see, it loops from Runnning 19/20 back to Running 0/20 without any warning or error:
Screenshot 2021-09-21 at 11 52 25

@jonmmease
Copy link
Contributor

Thanks for the report @sam1902,

I managed to whittle down the issue to the behavior of Interval itself (which long_callback is using under the hood).

from dash import dcc, html, Dash, Input, Output
from time import sleep

# App gets stuck if interval_s <= callback_sleep_s
interval_s = 1
callback_sleep_s = 2

app = Dash(__name__)
app.layout = html.Div([
    dcc.Interval(id="interval", disabled=False, interval=interval_s * 1000),
    html.Div(id="n_intervals_display")
])

@app.callback(
    Output("interval", "disabled"),
    Input('interval', 'n_intervals'),
)
def update_output(n_intervals):
    sleep(callback_sleep_s)
    return True


@app.callback(
    Output("n_intervals_display", "children"),
    Input('interval', 'n_intervals'),
)
def update_interval(n_intervals):
    return f"n_intervals: {n_intervals}"


if __name__ == '__main__':
    app.run_server(port=8051)

It's seems that it's impossible to disable an interval from a callback with duration that exceeds the interval duration.

In the long callback case, callback_sleep_s doesn't represent the runtime of the user callback function. It represents the overhead of launching the background process and checking on the results (and probably the serialization time as well).

@alexcjohnson @chriddyp is this the expected (and desirable) behavior of the renderer?

@usr-ein
Copy link
Author

usr-ein commented Sep 22, 2021

I just tried setting interval=120_000, to test if it would indeed alleviate the issue, but I still have my job being run once it finished, instead of outputing. (my bad, didn't sync with the server)

After looking through the long_callback's def wrapper(): ... def callback(): .. closure, I think that this bug is in fact a race condition, where the dash.py code does something like the following:

  1. Compute pending_key ( == 53c...)
  2. Get current_key (it is None)
  3. Checking if the results are ready (they are!)
  4. Fetching the results (very slow)
  5. Set current_key == pending_key
  6. Return the output and updated user_store_data (very slow)
  7. Next call, we should still have current_key == pending_key, but lo and behold, we still have current_key is None !!

I arrived at that scenario after a lot of ugly logging in my custom fork.

Here is a pseudo-code of what goes on:

def callback(user_store_data, args):
    pending_key = cache.build_key(args)
    current_key = user_store_data["current_key"]
    
    should_cancel = (pending_key == current_key)
    ## Call Number 2:
    # current_key   == None
    # pending_key   == 52c...
    # should_cancel == False
    
    if should_cancel:
        return cancel()
    
    if results_ready(pending_key):
        ## Call Number 1:
        # current_key   == None
        # pending_key   == 52c...
        # should_cancel == False
        
        results = get_results()
        
        # Set this so that next interval call, we have
        # current_key == pending_key and therefore cancel happens.
        # But the thing is, it never does happen ....
        user_store_data["current_key"] = pending_key
        
        return dict(
                results=results,
	        interval_disabled=False,
	        user_store_data=user_store_data
	        )

I think the only solution is to not rely on setting current_key = pending_key and instead terminate the job and interval there. I tried writing a small fix that would replace interval_disabled = pending_job is None by

interval_disabled = True
user_store_data["current_key"] = None
user_store_data["pending_key"] = None
user_store_data["pending_job"] = None
callback_manager.terminate_job(pending_job)

but this didn't help either.

I think that returning a large amount of data makes the callback return chain slow, which let the interval trigger in that time frame, and since the return callback chain has not finished yet, we still have current_key = None, and then since that second call returns after the first call, it overwrites the current_key value in the user_callback_output so every subsequent callback() invocations see it as None.

I'm not familiar enough with Dash's insides to restructure how long_callbacks rely on return dict(interval_disabled=True, ...) to actually disable the interval. I feel like there should be a way to disable the interval without returning, but by calling a method or something independent from the thread where this callback() function runs.

@jonmmease
Copy link
Contributor

Thanks for the detailed debugging @sam1902!

Next call, we should still have current_key == pending_key, but lo and behold, we still have current_key is None !!

I think this has the same root cause as the example I posted above (#1769 (comment)). In both cases, it looks like the updates returned by the callback are ignored if another interval is scheduled during the execution of the callback. In this example, the update was to disable the interval. In the long callback case, the update is to update the Store component with the new current_key value.

@usr-ein
Copy link
Author

usr-ein commented Sep 24, 2021

I think so too, but the reason another interval is scheduled during the execution of the callback is that fetching the results of my callbacks takes too long (because of the I/O size).
I fail to see any satisfying solution, as I can't really reduce the I/O size to make the callback any faster.

Earlier I mentionned that setting interval=12_000 didn't fix my issue, but it in fact did, so I think your reproduction code is perfect.

The callback_sleep_s is proportional to my callback's output size, and therefore, to make the bug disappear, I need to set interval to something large enough to handle my biggest output size. So interval=3_000 might work when I return a small image, but the callback will loop if I try to output a 4K image for example. So far my work around is to look of the interval value that will make the bug disappear for my largest output size possible, and set it to that. The only drawback is that I don't get my progress refreshed as often since the interval is slow to fire, but for now I can live with that.

Ideally, long_callback should have two different mecanism to:

  1. Return the output of the job/fetch progress of the job
  2. Stop the interval timer
    And these two mecanisms should be decoupled to allow to stop the interval timer without waiting on the output

@vsisl
Copy link

vsisl commented Oct 1, 2021

I believe that this bug is an effect of a more general problem related directly to the dcc.Iterval component. If the dcc.Interval is triggering a callback whose duration is longer than the interval property of the dcc.Interval component, the callback will never be fully executed.

Furthermore, if this callback is supposed to disable the dcc.Interval under certain circumstances, the dcc.Interval will never be disabled and the app will get stuck in a loop. It has been discussed here.

@amarvin
Copy link

amarvin commented Oct 18, 2021

I've got a long_callback that also seems to be running multiple times. The callback ends with a dcc.send_file, which can be large and thus Celery may take some time to retrieve the result from Redis. The progress bar goes like

0% > 50% > 90% > 10% > 40% > 95% > 20% > 60%

then my web browser receives three identical files 😆. I wish it only sent one file and that the final progress was 100% after the long_callback finished...

I'm very excited to switch to long_callback, so I don't need to manually manage each long callback (e.g. define a callback to create the uuid and change dcc.Interval frequency, define a callback to check progress via dcc.Interval and return results, and define a Celery task). I have more control doing it that way, but there's so much boilerplate.

@usr-ein
Copy link
Author

usr-ein commented Oct 18, 2021

My app was relying heavily on long_callback, and because of this bug and also other minor inconveniences related to using Dash, I had to find an alternative solution. Therefore, I steered away from using Dash altogether and rewrote the whole app in React, FastAPI, Celery, and used Socket.io to send live progress update from the server to the client without relying on a recurrent timer like dcc.Interval or setTimeout (in JS).

Here is the tutorial I got mainly inspired by, but I added the whole socket.io layer so that I could get the updates client side without having to ask for them explicitly to the server. Whenever it has progress to report, the celery task makes an HTTP request to the FastAPI backend, which upon receiving it transfers it to the approriate client over Socket.io, making the Celery task able to report any progress update (even large pictures) to the client.

Overall that stack is rather solid, but it required a full rewrite.

Dash could also forgo using a dcc.Interval to "fetch" updates from Celery punctually and instead use a socket.io (or plain WebSockets) client/server to deliver the updates as they're emitted by the Celery task, but that'd require having an ASGI compatible backend, and idk if Dash works like that behind the curtain.

@amarvin
Copy link

amarvin commented Oct 20, 2021

My app was relying heavily on long_callback, and because of this bug and also other minor inconveniences related to using Dash, I had to find an alternative solution. Therefore, I steered away from using Dash altogether and rewrote the whole app in React, FastAPI, Celery, and used Socket.io to send live progress update from the server to the client without relying on a recurrent timer like dcc.Interval or setTimeout (in JS).

Here is the tutorial I got mainly inspired by, but I added the whole socket.io layer so that I could get the updates client side without having to ask for them explicitly to the server. Whenever it has progress to report, the celery task makes an HTTP request to the FastAPI backend, which upon receiving it transfers it to the approriate client over Socket.io, making the Celery task able to report any progress update (even large pictures) to the client.

Overall that stack is rather solid, but it required a full rewrite.

Dash could also forgo using a dcc.Interval to "fetch" updates from Celery punctually and instead use a socket.io (or plain WebSockets) client/server to deliver the updates as they're emitted by the Celery task, but that'd require having an ASGI compatible backend, and idk if Dash works like that behind the curtain.

Cool solution, and I wish some of those ideas got incorporated into Dash's long_callback, but I don't want to switch technology stacks just for this feature. I'm trying to keep the team fullstack Python, and Dash is the best option for that.

@mja21
Copy link

mja21 commented Oct 28, 2021

I'm very excited to switch to long_callback, so I don't need to manually manage each long callback (e.g. define a callback to create the uuid and change dcc.Interval frequency, define a callback to check progress via dcc.Interval and return results, and define a Celery task). I have more control doing it that way, but there's so much boilerplate.

@amarvin how do you determine the frequency of the interval? Could you share an example of your approach?
I am also thinking about manually creating a long callback since my responses can be very small or large depending on the user input.

@sebbegg
Copy link

sebbegg commented May 18, 2022

Hi there,

I came across the same issue, when returning large responses as a result of a long_callback.
I'm not sure I got it all right, but I thought of changing the long_callback logic as follows:

  • Instead of directly returning the result in the interval-triggered callback the readiness of the result is signaled to another property
  • When the result is ready, the interval is stopped and the user_store_data is reset directly - not through the should_cancel logic in the next invocation
  • A second callback, that's setup within the long_callback decorator acts on change of the result property and fetches the result

This way, the result-fetching part only starts when the interval is already stopped.
I have a locally modified version with this available that seems to be working ... though I'm not sure if it covers all corner-cases of the long_callback.

I'd be willing to make a PR for this, if someone agrees that the concept is reasonable.

@T4rk1n
Copy link
Contributor

T4rk1n commented May 18, 2022

@sebbegg

That looks like a fine solution, I'll be happy to review if you make a PR.

@vkolevrestec
Copy link

Hi, is this planned for the next release or is it deeper in the backlog?

@T4rk1n T4rk1n self-assigned this May 31, 2022
@T4rk1n
Copy link
Contributor

T4rk1n commented May 31, 2022

@vkolevrestec I am working on this, my plan is to remove the interval component and handle the polling in the renderer so there can be only one request at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants