Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(websocket): wait for task on destroy (IDFGH-14533) #751

Merged

Conversation

johanstokking
Copy link
Contributor

@johanstokking johanstokking commented Jan 28, 2025

Description

This ensures that the task is awaited on destroy.

This fixes a race condition between two independent actions:

  1. The WebSocket task detects a disconnect (e.g. read fails), initiates abort, sets run to false, posts the DISCONNECTED event, breaks, then posts the FINISHED events and deletes the task
  2. The application destroys the WebSocket client (e.g. via an asynchronous invocation via receiving the disconnected event)

Currently, in esp_websocket_client_destroy() would immediately destroy all resources, because run is already false. However, the task may still be running, and it needs the resources to cleanly stop the task (e.g. post the FINISHED event).

This change ensures that the task has stopped. The task must be stopped before resources can be freed.

The run check in the destroyer is also racy, so the wait shouldn't happen in an else.

Testing

I tested this locally.

Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass.
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

@johanstokking
Copy link
Contributor Author

@gabsuren thanks for considering this one too

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 28, 2025
@github-actions github-actions bot changed the title fix(websocket): wait for task on destroy fix(websocket): wait for task on destroy (IDFGH-14533) Jan 28, 2025
Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the fix!

So the real issue is actually using (client->run) to determine if the client is still running.
And we do the same thing in the ws_client_stop(), correct?
maybe checking status_bits for STOPPED_BIT would solve the issue, too, and we could even reduce some code duplication and do:

esp_err_t esp_websocket_client_stop(esp_websocket_client_handle_t client)
{
    return check_and_stop_internal(client);
}

esp_err_t esp_websocket_client_destroy(esp_websocket_client_handle_t client)
{
    esp_err_t err = check_and_stop_internal(client);
    // ignore ESP_FAIL for some "already stopped" case
    destroy_and_free_resources(client);
    return err;
}

just an idea on possible improvement, but your current changes in this PR LGTM!

Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix unit tests per: https://github.com/espressif/esp-protocols/actions/runs/13029806393/job/36348129127

this change broke the basic init-destroy usecase (since STOPPED_BIT is not set upon initialization),
and also, destroy is called on any unsuccessful init (so need to take into account that the event group might not be initialized yet)

@johanstokking johanstokking force-pushed the fix/await-task-on-destroy branch from d252280 to 7fd85c6 Compare January 29, 2025 13:23
@johanstokking
Copy link
Contributor Author

Yes, checking for the STOPPED_BIT is definitely better, as run is for signaling and status_bits is state and that's what we want.

I followed your suggestion partially; the check is different for stop vs destroy. In stop, it's okay to log a warning; in destroy, it shouldn't warn if it wasn't started. So they now share the logic to stop and wait, and both check for the STOPPED_BIT. Indeed, the STOPPED_BIT should be set on initialization.

@johanstokking johanstokking force-pushed the fix/await-task-on-destroy branch from 7fd85c6 to bb87623 Compare January 29, 2025 13:28
@johanstokking johanstokking force-pushed the fix/await-task-on-destroy branch from bb87623 to 42674b4 Compare January 29, 2025 13:30
Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@david-cermak david-cermak merged commit b57979d into espressif:master Jan 29, 2025
55 checks passed
@johanstokking johanstokking deleted the fix/await-task-on-destroy branch January 29, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants