Infinite Loop Symptoms on Prefect Server #15607

a14e · 2024-10-08T00:12:24Z

Bug summary

I ran into an issue with Prefect while working locally that looks like an endless loop happening on the backend.

I've observed the following behavior locally several times:

After running several dozen manual script executions, the workflows stop displaying correctly in the web UI. Specifically, the green squares on new flows are no longer visible.
At the same time, I see a large stream of logs in the PostgreSQL terminal.
Restarting Prefect resolves the issue.

Logs:

prefect-postgres-1  | 2024-10-07 23:57:37.075 UTC [2893] STATEMENT:  INSERT INTO task_run_state (task_run_id, type, timestamp, name, message, state_details, data, id, created, updated) VALUES ($1::UUID, $2::state_type, $3::TIMESTAMP WITH TIME ZONE, $4::VARCHAR, $5::VARCHAR, $6, $7::JSON, $8::UUID, $9::TIMESTAMP WITH TIME ZONE, $10::TIMESTAMP WITH TIME ZONE) ON CONFLICT (id) DO NOTHING
prefect-postgres-1  | 2024-10-07 23:57:37.082 UTC [2893] ERROR:  duplicate key value violates unique constraint "uq_task_run_state__task_run_id_timestamp_desc"
prefect-postgres-1  | 2024-10-07 23:57:37.082 UTC [2893] DETAIL:  Key (task_run_id, "timestamp")=(3649e07c-9883-42ac-8819-76c821f5f8eb, 2024-10-07 23:45:17.096643+00) already exists.
prefect-postgres-1  | 2024-10-07 23:57:37.082 UTC [2893] STATEMENT:  INSERT INTO task_run_state (task_run_id, type, timestamp, name, message, state_details, data, id, created, updated) VALUES ($1::UUID, $2::state_type, $3::TIMESTAMP WITH TIME ZONE, $4::VARCHAR, $5::VARCHAR, $6, $7::JSON, $8::UUID, $9::TIMESTAMP WITH TIME ZONE, $10::TIMESTAMP WITH TIME ZONE) ON CONFLICT (id) DO NOTHING
....

Version info (`prefect version` output)

Version:             3.0.4
API version:         0.8.4
Python version:      3.12.6
Git commit:          c068d7e2
Built:               Tue, Oct 1, 2024 11:54 AM
OS/Arch:             linux/x86_64
Profile:             ephemeral
Server type:         server
Pydantic version:    2.9.2

Additional context

PostgreSQL version: 16

The text was updated successfully, but these errors were encountered:

a14e · 2024-10-08T00:18:08Z

Here is an example of the error in the UI. You can see that tasks are not displayed, but there are saved artifacts. The Task Runs list is also empty.

Artifacts

Task Runs

a14e · 2024-10-08T15:01:54Z

Workaround:
Execute a command on the PostgreSQL server of Prefect

DROP INDEX IF EXISTS uq_task_run_state__task_run_id_timestamp_desc;
CREATE INDEX uq_task_run_state__task_run_id_timestamp_desc
    ON task_run_state (task_run_id, timestamp DESC);

desertaxle · 2024-10-08T16:18:24Z

Thanks for the bug report @a14e! Do you have a flow that you can reliably reproduce this with? I suspect that there's something strange happening with the task run recorder, but it will be easier to confirm with a clear way to reproduce the issue.

a14e · 2024-10-08T20:37:20Z

@desertaxle Thank you for your reply!
Yes, I can reproduce the issue, but there is a significant amount of randomness involved. Below, I've provided the flow code, and on my machine, it took around 420 flow runs to reproduce the issue twice (I was running it from the UI).

My flow looks simplified like this:

from prefect import flow, task, serve
from typing import List


class DirectusLoaderTasks:

    @task(log_prints=True, name="Directus. Load Topic")
    async def get_topic_by_id(self, topic_id: str) -> str:
        return "123"

class GrammarDBTasks:

    @task(log_prints=True, name="DB. Load items for topic")
    async def load_all_items_from_db(self, topic_id: str) -> list[str]:
        return []

    @task(log_prints=True, name="DB. Load item groups for topic")
    async def load_all_groups_from_db(self, topic_id: str) -> list[str]:
        return []


@task(log_prints=True, name="Simple filter")
async def simple_filter(from_db: List[str]) -> List[str]:
    return []



class OpenAiTasks:
    @task(name="Open AI. filter3", retries=3)
    async def filter3(self,
                      groups: List[str],
                      topic: str) -> List[str]:
        return []

    @task(name="Open AI. filter1", retries=3)
    async def filter1(self,
                      groups: List[str],
                      topic: str) -> List[str]:

        return []

    @task(name="Open AI. filter2", retries=3)
    async def filter2(self,
                      groups: List[str],
                      groups_in_db: List[int],
                      topic: str) -> List[str]:

        return []

    @task(name="Open AI. Normalize", retries=3)
    async def normalize(self,
                        groups: List[str],
                        topic: str) -> List[str]:
        return []




@flow(log_prints=True,name="Item Generation flow")
async def generate_items_flow():

    directus_tasks = DirectusLoaderTasks()
    grammar_tasks = GrammarDBTasks()
    open_ai_tasks = OpenAiTasks()

    topic_id = await directus_tasks.get_topic_by_id("123")

    item_groups_from_db = await grammar_tasks.load_all_groups_from_db(topic_id)
    items_from_db = await grammar_tasks.load_all_items_from_db(topic_id)
    new_item_groups = await simple_filter(items_from_db)

    deduplicated_response: List[str] = await open_ai_tasks.filter1(
        new_item_groups,
        topic_id
    )

    filtered_response: List[str] = await open_ai_tasks.filter2(
        deduplicated_response,
        item_groups_from_db,
        topic_id
    )

    filtered_response = await open_ai_tasks.filter3(filtered_response,
                                                    topic_id)

    normalize_forms = await open_ai_tasks.normalize(filtered_response,
                                                    topic_id)

    return


if __name__ == "__main__":
    example_deploy = generate_items_flow.to_deployment(
        "Generate Grammar Item Groups",
        tags=["tag1", "tag2", "tag3"]
    )
    serve(example_deploy)

Here’s a video showing what the PostgreSQL logs look like in the terminal:

Recording.2024-10-08.232607.mp4

rejner · 2024-11-18T11:46:22Z

I have the exact problem that occurs in the local setup with the PostgreSQL database and prefect server running in a simple docker-compose project. The occurrence rate is quite high, it happens even with such a simple pipeline:

from prefect import flow, task
import pandas as pd
import time
import os
import random


@task(name="Transform Data 1")
def transform_data_1(path) -> pd.DataFrame:
    print(path)
    data = {"id": [1, 2, 3], "value": [10, 20, 30]}
    time.sleep(random.randint(1, 10))
    return pd.DataFrame(data)


@task(name="Transform Data 2")
def transform_data_2(path) -> pd.DataFrame:
    print(path)
    data = {"id": [4, 5, 6], "value": [40, 50, 60]}
    time.sleep(random.randint(1, 10))
    return pd.DataFrame(data)


@flow
def process_data(file_path: str):
    df1 = transform_data_1(file_path)
    print("Done 1")
    df2 = transform_data_2(file_path)
    print("Done 2")


@flow(name="Example Pipeline")
def example_pipeline():
    for i in range(5):
        process_data(
            file_path=f"file_{i}.csv",
        )


if __name__ == "__main__":
    os.environ["PREFECT_API_URL"] = "http://localhost:4200"
    example_pipeline()

I have tried rewriting this into async calls, but had no effect. Tried both postgres:16-alpine and postgres:17-alpine docker images for the postgres, different versions of python images for the prefect server as well. My version of prefect:

Version:             3.1.2
API version:         0.8.4
Python version:      3.10.15
Git commit:          02b99f0a
Built:               Tue, Nov 12, 2024 1:38 PM
OS/Arch:             linux/x86_64
Profile:             ephemeral
Server type:         ephemeral
Pydantic version:    2.9.2
Server:
  Database:          postgresql

The proposed workaround with the constraint modification on the database side works, but the container with the server is consuming an enormous amount of CPU resources and we need to restart the container to break the loop.

desertaxle · 2024-11-18T15:55:02Z

Thanks for the additional info @a14e and @rejner! I haven't been able to reproduce this on my end, but it looks like there's a race condition where the a state change for a task run can be emitted twice with the exact same timestamp. I haven't yet tracked down where that's happening, but the issue is exacerbated because we keep retrying event processing when there's a failure, which causes this infinite loop.

I think I can address the infinite loop issue by adding a dead-letter queue to the task run recorder. The race condition will still exist, but it won't overwhelm the server CPU when it happens, and maybe we'll get some more info about the states that are causing this underlying issue.

rejner · 2024-11-19T09:38:27Z

Thank you @desertaxle for such a quick response, your proposed solution seems reasonable for now, and I am looking forward to the fix ;)

a14e added the bug Something isn't working label Oct 8, 2024

desertaxle self-assigned this Nov 18, 2024

desertaxle mentioned this issue Nov 18, 2024

Adds a dead-letter queue to memory subscriptions to avoid infinite loops #16051

Merged

desertaxle closed this as completed in #16051 Nov 19, 2024

This was referenced Nov 25, 2024

Async tasks stays in Pending state until Completed on Windows Client #16100

Closed

3.x Task Runs went missing from 3.0.0rc18 to 3.0.0rc19 #15153

Closed

Fix setting task states with duplicate timestamps #16145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite Loop Symptoms on Prefect Server #15607

Infinite Loop Symptoms on Prefect Server #15607

a14e commented Oct 8, 2024

a14e commented Oct 8, 2024 •

edited

Loading

a14e commented Oct 8, 2024 •

edited

Loading

desertaxle commented Oct 8, 2024

a14e commented Oct 8, 2024 •

edited

Loading

rejner commented Nov 18, 2024 •

edited

Loading

desertaxle commented Nov 18, 2024

rejner commented Nov 19, 2024

Infinite Loop Symptoms on Prefect Server #15607

Infinite Loop Symptoms on Prefect Server #15607

Comments

a14e commented Oct 8, 2024

Bug summary

Version info (prefect version output)

Additional context

a14e commented Oct 8, 2024 • edited Loading

Artifacts

Task Runs

a14e commented Oct 8, 2024 • edited Loading

desertaxle commented Oct 8, 2024

a14e commented Oct 8, 2024 • edited Loading

rejner commented Nov 18, 2024 • edited Loading

desertaxle commented Nov 18, 2024

rejner commented Nov 19, 2024

Version info (`prefect version` output)

a14e commented Oct 8, 2024 •

edited

Loading

a14e commented Oct 8, 2024 •

edited

Loading

a14e commented Oct 8, 2024 •

edited

Loading

rejner commented Nov 18, 2024 •

edited

Loading