-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.x Task Runs went missing from 3.0.0rc18 to 3.0.0rc19 #15153
Comments
Hi @tothandor, thanks for the bug report! Could you confirm two things for me:
Thank you! |
Hi @cicdw , I was trying to reproduce the missing task runs with a simplified flow, but all I could achieve was an empty "run graph". This is the flow code so far. # dbflow/dbflow_class.py
import logging [5/1835]
import pandas as pd
import sqlalchemy as sqla
from prefect import flow, task
from prefect_sqlalchemy import SqlAlchemyConnector
log = logging.getLogger()
class DBFlow:
@task
def dbconnect(self):
log.info("Connecting to database")
self.db = SqlAlchemyConnector.load("dbconn").get_engine()
@staticmethod
@task
def q1(db):
log.info(f"QUERY 1")
df = pd.read_sql('select * from cities limit 10', db)
return df
@task
def q2(self):
log.info(f"QUERY 2")
df = pd.read_sql('select * from forecast limit 10', self.db)
return df
@task
def l2(self, df: pd.DataFrame):
log.info(f"LOADING TO DATABASE")
df.to_sql('forecast_test', self.db, schema='test', if_exists='replace')
@flow(name='DBFlow')
def run(self=None):
self.dbconnect()
self.cities_ = self.q1(self.db)
self.forecast_ = self.q2()
self.l2(self.forecast_)
dbflow = DBFlow()
dbflow_flow = dbflow.run # prefect.yaml
name: prefect3
prefect-version: 3.0.0rc20
# build section allows you to manage and build docker images
build:
# push section allows you to manage if and how this project is uploaded to remote locations
push:
# pull section allows you to provide instructions for cloning this project in remote locations
pull:
- prefect.deployments.steps.set_working_directory:
directory: /home/andor.toth/work/prefect3
# the deployments section allows you to provide configuration for deploying flows
deployments:
- name: dbflow
version:
tags: []
concurrency_limit: 1
description:
entrypoint: dbflow.dbflow_class:dbflow_flow
parameters: {}
work_pool:
name: default
work_queue_name:
job_variables: {}
enforce_parameter_schema: true
schedules:
- cron: '*/5 * * * *'
timezone: Europe/Budapest
day_or: true
active: true
max_active_runs:
catchup: false |
I have retried the upgrade from 3.0.0rc18 to 3.0.0rc20 with particular attention to every steps, including Maybe it's not important, but I get a lot of these errors. Both in rc18 and rc20. And in most cases when a flow fails, I get this stack trace.
If I revert back to rc18, then everything is fine again. |
I have tried some intricate dummy flows, but all I could achieve is an incomplete task run graph. But the task runs appeared correctly in the list. # dbflow/forkmerge.py
import logging
from prefect import flow, task
# from prefect_sqlalchemy import SqlAlchemyConnector
log = logging.getLogger()
class ForkMergeFlow:
@task
def dbconnect(self):
log.info("Connecting to database")
self.db = lambda: 'dummyresource'
# self.db = SqlAlchemyConnector.load("dbconn").get_engine()
@staticmethod
@task
def q1(db):
log.info(f"QUERY 1")
return [1,2,3,4]
@task
def q2(self):
log.info(f"QUERY 2")
return list('abcd')
@classmethod
@task
def q3(cls, db):
log.info(f"QUERY 3")
return dict(c=3, d=4, e=5, f=6)
@task
def p11(self, data): # result #1 - processor #1
log.info('PROCESS 1 RESULT 1')
return sorted(data)
@task
def p21(self, data): # result #2 - processor #1
log.info('PROCESS 1 RESULT 2')
return list(enumerate(data))
@task
def p22(self, data): # result #2 - processor #2
log.info('PROCESS 2 RESULT 2')
return list(reversed(data))
@task [15/1893]
def m12(self, r1, r2): # merge process result #1 & #2
log.info('MERGE RESULT 1 & 2')
return list(zip(r1, r2))
@task
def l1(self, data):
log.info('LOAD 1')
print(data)
@task
def l2(self, data):
log.info(f"LOAD 2")
print(data)
@task
def l22(self, data):
log.info(f"LOAD 2 PROCESSED RESULT 2")
print(data)
@task
def lm12(self, data):
log.info(f"LOADING MERGED DATA")
print(data)
@flow(name='ForkMerge')
def run(self=None):
self.dbconnect()
# raw results
self.r1_ = self.q1(self.db)
self.l1(self.r1_)
self.r2_ = self.q2()
self.l2(self.r2_)
self.r2_ = self.q2() # once more
self.rp21_ = self.p21(self.r2_)
self.rp22_ = self.p22(self.r2_)
self.l22(self.rp22_)
self.r3_ = self.q3(self.db)
# processed results
self.rp11_ = self.p11(self.r1_)
self.l1(self.rp11_)
# merged results
self.rm12_ = self.m12(self.rp11_, self.rp21_)
# load results
self.lm12(self.rm12_)
fmflow = ForkMergeFlow()
fmflow_run = fmflow.run The entrypoint for the deployment is |
@tothandor thanks for this! A few more questions to help diagnose:
|
Hi @jakekaplan , I have added PREFECT_DEBUG_MODE=True to the configuration ( The
The server show some sqlite3 exceptions:
And there's an other regarding database locking.
EventsWorker errors were present in earlier versions too. The API URL is set in the profile config to the following value: I hope this helps! |
Thank you @tothandor that is helpful! The Heres a simplified test you can run: import asyncio
from prefect.events.clients import PrefectEventsClient
async def main():
async with PrefectEventsClient() as client:
print(f"Connected to: {client._events_socket_url}")
pong = await client._websocket.ping()
pong_time = await pong
print(f"Response received in: {pong_time}")
if __name__ == '__main__':
asyncio.run(main()) when connecting successfully gives me:
Is your server traffic whitelisted or proxied in some way? I've seen instances in the past where |
Hi! I’m encountering the same issue after upgrading from Prefect 2.x to the new 3.0.0. My setup includes Even though the flows execute successfully, task runs are not always registered in the dashboard. When running a simple flow with The error It’s strange because this issue doesn’t occur consistently; in some runs, tasks are visible in the dashboard even though the Additionally, the function provided results in the following: Thanks! |
Hi @jakekaplan ,
And since I have upgraded to 3.0.0, but it doesn't make a difference. |
Hey @lucasdepetrisd -- the |
hey @tothandor thanks trying out the test script. Unfortunately I'm not able to reproduce what you're seeing here so I may need some more help on your end investigating! Let me give a brief overview of what's happening and hopefully we can try and piece things together.
After Client Side :
Server Side:
I know this is a lot of information. But I'm hoping you may be able to run a no-op task and track the events from the client -> server and following the above see if you notice anything out of the ordinary. Happy to answer any other questions as well, thanks in advance for your assistance in working through this! |
Thanks for the clarification @cicdw! It makes sense that the |
Hey @jakekaplan, thanks for the explanation! I ran a basic flow with a no-op task and checked the database and logs as you suggested. Here’s what I found in the logs about the task:
It seems it's all as expected on the client side and no signs of an error except for that ValueError @cicdw explained before. However, on the server side, I queried the database and found: For
But for Additionally, I observed that when restarting the local server, tasks are briefly stored and then stop registering after a few minutes. Any ideas on why this might be happening? |
@lucasdepetrisd thanks for doing that. Can you confirm for me that both your client and server versions are |
Yes, I ran the prefect version command and got:
I also checked the server version using the endpoint However, when querying |
Hi @jakekaplan , Thanks a lot for the detailed explanation! I have an other development server (dev15) that runs Prefect 3 with a bit more difficult workflow called CountryWeatherFlow, which collects and processes weather data. After failing to upgrade it to rc19, I have modelled the dummy ForkMergeFlow to the test server (t84) from that, and did my experimentation and debugging on that simplified flow. Now, I have made a new try with the upgrade on the dev server (from rc18 to GA), resetting the database, recreating blocks, and redeploying the workflows. Unlike the test server, the dev server uses an Apache 2.4 reverse proxy, therefore I have aligned its configuration so it could handle both http and websockets (with wstunnel module). I am sorry, but I may have mixed up some of the conclusions from the different servers, and beleived that task runs could be reliably checked on the run graph. After all the blocked websockets caused the problem as you have guessed it early. Thanks for all your time and efforts to solve this issue! |
Hi @jakekaplan! I upgraded to version 3.0.1, restarted the server, and let it run over the weekend, but the issue persists. Would you prefer I open a new issue, or should we keep this one closed and continue the conversation here or at the new #15274 issue? |
@lucasdepetrisd I'm going to re-open this issue. The other issue while showing similar symptom seems to be from the client not being able to open a websocket connection to the server, which is how you would get the We've added some additional server side debug logging. If I could ask you to enable the following settings server side before restarting your server again for them to kick in:
(also double check you're on
The server debug logs can be pretty noisy as there is a lot going on, if you're having difficultly parsing the logs you can opt to temporarily disable some of the other services ( |
@lucasdepetrisd were you able to investigate any further? |
If anyone is coming to this issue unable to view task runs from showing up please following these debug steps (consolidating a couple comments from above). When you execute a task run, it will emit events via websocket to the server. These events are converted and inserted into the Following the below, you should be able to track a task run from execution time, through client side event emission, server side receipt and finally db insertion. We're looking for any error logs or indication what point in the process something is breaking down. Debug Steps:1. Confirm that your client and server are both on 3.0.2 (or whatever the latest version is)2. Setup debug env vars
3. Setup test taskRun the following task (either directly or in a flow whatever method was leading to you not see task runs before): from prefect import task
@task(log_prints=True)
def hello_task():
print("Hello, world!") 4. Track the task run through the client:
You should see 3 task run state events make their way to the 5. Track the task run through the server:The server will accept the task run events via the weboscket and you should see the following debug logs:
|
Hi @jakekaplan, you were exactly right, Jobs were not able to communicate with prefect server API due to incorrect certificate store configuration (missing CA certificates). I was confused about the failed GlobalEventLoopThread logs seen from the worker pod specially because I was able to communicate to prefect server through websocket connection when testing from the worker pod but I didn't realize the failed logs were coming from the jobs, I could confirm looking at the jobs logs and finding the same GlobalEventLoopThread failures, also my ad-hoc websocket test was failing when using jobs container image. Problem got solved fixed fater adding the right certificates into the jobs/flows container image. |
Same for me, when I enabled both debugging options the problem disappeared! |
Hi @kavvkon, @jbw-vtl , @lucasdepetrisd . I believe your issue is probably related more to #15607 and follow up #16100. If your clients are on Windows, please look at these issues. |
Afraid running within docker compose on debian. |
In your first comment you provided |
I had the same problem upgrading from Prefect 2.20.6 to Prefect 3.1.5 in my k8s instance:
Thanks to @zzstoatzz for confirming that outbound websockets was the issue. Updating my IngressRoute fixed it:
|
Also ran into the websocket related issue following an upgrade to We use a fairly basic docker compose setup, with both the agent and server being in the same docker network. Agent directly communicates to server using http://server:4200/api (which is the service name for the server) This is running on a Debian linux. The flows are showing subflow runs within the task_run table and UI, however not the actual task runs. Any help is greatly appreciated, despite recreating all the server and agent containers can't seem to resolve this one. Have not yet tried resetting the whole database.
I have ruled out the basic connection not working, can successfully connect from within the agent container to the websocket on the server using something like below import asyncio
import websockets
def test_url(url, data="{}"):
async def inner():
print(f"Connecting to {url}")
async with websockets.connect(url) as websocket:
print("Connected")
await websocket.send(data)
print("Successfully send data")
return asyncio.run(inner())
test_url("ws://server:4200/api/events/in") The schema of course is wrong (sending an empty json), however can connect successfully with the server failing upon validating the request with pydantic
|
Just providing a bit more info, we tried fully resetting the database and recreating, same issue for 3.1.8. |
Hi @jbw-vtl: You'll see the error in your logs is:
Websockets in general are only supported on > In the test you ran the you can see the server respond with the correct http version.
Some questions that maybe could help debug:
|
Previously we were running We do use an nginx as reverse proxy for SSL exposing the prefect server , however can reproduce the issues with just a server and agent in docker compose, without any further components. full docker compose below (company related content in
We run a slightly modified version of the prefect base image as the vm itself is airgapped, so installing # Slightly modified version of the prefect baseimage to include docker dependencies for airgapped environment
FROM prefecthq/prefect:3.1.8-python3.12
RUN pip install prefect[docker]==3.1.8 --no-cache-dir Full logs of both server and agent in debug below, following manually triggering a deployment through the UI
|
Hmm, upon further inspection downgrading to 3.1.5, including downgrading the database revisions is running into the same websockets errors. However at this point not sure what could have changed here. Will keep investigating the coming days to see if I find something sticking out. Job execution etc all working as expected |
Thank you all that info is really helpful. Will try and see if I can reproduce something similar on my end with this. We did add some proxy support for websockets as part of 3.1.7 in this pr that I thought somehow could be related (although I don't immediately see how as it doesn't look like you're configuring |
After updating to 3.1.9 problem still persists. Task runs are created but not plotted. And this time changing the logging level to debug does not make it work.. |
Okay, after a lot more debugging, did identify the issue for us.
{
"NO_PROXY":"<company>.com,server,127.0.0.1,0.0.0.0,localhost",
"HTTP_PROXY":"<proxy>",
"HTTPS_PROXY":"<proxy>",
}
Unsure whether there's helper functions around somewhere to validate whether a host matches a given Short term we will just remove the |
Some more logs: Run Client logs
Server logs
|
@jbw-vtl no worries at all, thank you for the breakdown. That's very helpful. If you could file an issue for supporting |
@kavvkon from your logs I can see task runs inserted into the database. Does task run |
Yes I can see task run Indeed there is an error in graph-v2 request
|
@kavvkon can you look at the server logs when you hit |
hmm not what I expected
|
That's not what I would have expected either... Maybe one of the task run timestamps doesn't have a tz? I'm not positive how that's possible. @kavvkon Could you open a separate issue for this? An MRE with a |
Going to close this issue as the likely primary causes have been addressed and it has become a bit of a catch all.
If anyone continues to see missing task runs, please follow these debugging steps and open a new issue with a MRE. Thank you! |
Bug summary
Up until 3.0.0rc18 task run completions were registered correctly with a local Prefect server, but after that with 3.0.0rc19 and 3.0.0rc20 the list of task runs remain empty even after all tasks were completed successfully.
So, nothing will be listed on a dashboard URL like that:
http://localhost:4200/runs/flow-run/a161f610-630f-43a5-8d22-93845c4659a2?tab=Task+Runs
I am still investigating the issue, but downgrading helped to solved it for me.
The flow that I use loads an sqlalchemy-connector block in the first task, and pass the engine to downstream tasks to be used for issuing queries. Maybe this is not a viable method, so I am planning to check this with a simplified flow with dummy tasks.
Version info (
prefect version
output)Additional context
I get a lot errors like that from the worker:
The text was updated successfully, but these errors were encountered: