Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip NextBus update if integration is still loading #123564

Merged
merged 4 commits into from
Aug 18, 2024

Conversation

drozycki
Copy link
Contributor

@drozycki drozycki commented Aug 11, 2024

Fixes a race between the loading thread and
update thread leading to an unrecoverable error

Proposed change

The NextBus integration has a race condition at HA instance start between the thread which sequentially loads configured entities and the update thread which iterates through the set of loaded entities and makes NextBus API calls. The integration never recovers from this error, and will never update again until the HA instance is restarted. The likelihood of encountering this race condition increases as you increase the number of entities.

My fix checks the hass object state and returns an empty dict if it is not yet running. I have confirmed that this fixes the issue and that all automated tests pass.

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New integration (thank you!)
  • New feature (which adds functionality to an existing integration)
  • Deprecation (breaking change to happen in the future)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • I have followed the perfect PR recommendations
  • The code has been formatted using Ruff (ruff format homeassistant tests)
  • Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

If the code communicates with devices, web services, or third-party tools:

  • The manifest file has all fields filled out correctly.
    Updated and included derived files by running: python3 -m script.hassfest.
  • New or updated dependencies have been added to requirements_all.txt.
    Updated by running python3 -m script.gen_requirements_all.
  • For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.

To help with the load of incoming pull requests:

@home-assistant
Copy link

Hey there @ViViDboarder, mind taking a look at this pull request as it has been labeled with an integration (nextbus) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of nextbus can trigger bot actions by commenting:

  • @home-assistant close Closes the pull request.
  • @home-assistant rename Awesome new title Renames the pull request.
  • @home-assistant reopen Reopen the pull request.
  • @home-assistant unassign nextbus Removes the current integration label and assignees on the pull request, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

@gjohansson-ST gjohansson-ST added this to the 2024.8.2 milestone Aug 11, 2024
@@ -52,6 +52,10 @@ async def _async_update_data(self) -> dict[str, Any]:
"""Fetch data from NextBus."""
self.logger.debug("Updating data from API. Routes: %s", str(self._route_stops))

if self.hass.state is not CoreState.running:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems quite wierd that this integration would be dependent on HA startup (are we fixing a symptom?).

However it would be better to use from homeassistant.helpers.start import async_at_started in __init__.py to delay the startup of this integration rather than returning some empty dict here. See example in speedtestdotnet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to use this pattern instead. I verified that this resolves the issue on my production HA instance.

@home-assistant home-assistant bot marked this pull request as draft August 11, 2024 12:57
@home-assistant
Copy link

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

@joostlek
Copy link
Member

I'd be interested to see the logs of this happening to fully understand why this is happening

@ViViDboarder
Copy link
Contributor

Same. As far as I can tell, this module isn't doing anything unusual with the way the coordinator works. Unless maybe there is a limit to the number of HTTP requests executed sequentially in an executor job, but that doesn't seem like waiting on startup would do anything to resolve it.

@drozycki
Copy link
Contributor Author

drozycki commented Aug 14, 2024

The log is in the linked issue #123562. The update thread iterates over the RouteStop set and throws when it detects that another thread has mutated the set during iteration. I don't understand the codebase well enough to be certain that set population happens during startup, but I didn't see a good way to wait until the component config has finished loading. Maybe there still is a race condition, but the added delay on the update thread is sufficient to avoid it on my system at my scale (10x RouteStop).

@joostlek
Copy link
Member

I think we should re-evaluate the logic within nextbus instead. The issue you are experiencing could also happen the moment someone reloads the integration. So I'd rather opt for a more robust solution that would remove this issue completely.

@ViViDboarder Why does every agency have its own coordinator as opposed to per entry?

@gjohansson-ST gjohansson-ST marked this pull request as draft August 14, 2024 18:57
@ViViDboarder
Copy link
Contributor

Ah. I see. So they share a coordinator because of the way the previous API was structured where you could fetch multiple predictions in a single call if they shared an agency.

That’s not relevant now.

However, I was recently informed that we can fetch multiple routes predictions if they share a stop (found via a bug report because when we fetch for a route it returned all for the stop).

Refactoring to share a coordinator per Agency-StopID would be the best approach, but will require another change to the client library.

All that said each coordinator should only run once though, so there should be no parallel mutations of the set of stops, unless refresh is being triggered after each stop is added, not after completing the entity initializations.

Should figure out if that is indeed the cause because a refactor to use stop based coordinators could still expose the same issue.

@ViViDboarder
Copy link
Contributor

Another possible patch would be to have the inline function that is being run in the executor and iterating over the stops access the stops via a parameter rather than from the coordinator directly using a shallow copy, that way if the set is mutated (a new route is added) while an update is happening, it will not run into any issue.

@joostlek
Copy link
Member

Yes you are refreshing after every integration addition

@joostlek
Copy link
Member

But if there isn't any limit, let's just do not do smart things

@drozycki
Copy link
Contributor Author

I updated this PR to use a local copy of _route_stops and avoid the race condition that way. Manual testing and the pytest suite all pass.

@gjohansson-ST
Copy link
Member

I updated this PR to use a local copy of _route_stops and avoid the race condition that way. Manual testing and the pytest suite all pass.

Much better to solve the issue. Some questions still outstanding but that can be handled in follow-up so if you're fine with this PR don't forget to click "Ready for review" 👍

@frenck frenck removed this from the 2024.8.2 milestone Aug 16, 2024
@IamTheFij
Copy link

Were you able to verify this solves the problem? I'm pretty sure it should, but once you've confirmed click "Ready for Review" as @gjohansson-ST said.

@drozycki drozycki marked this pull request as ready for review August 18, 2024 11:06
@joostlek joostlek added this to the 2024.8.3 milestone Aug 18, 2024
@joostlek joostlek merged commit 04b0760 into home-assistant:dev Aug 18, 2024
26 checks passed
@drozycki drozycki deleted the race branch August 19, 2024 02:13
@github-actions github-actions bot locked and limited conversation to collaborators Aug 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NextBus] Integration crashes after startup when several entities configured
7 participants