Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appropriate retry mechanism when Zaptec cloud is not working #90

Closed
Tracked by #130
sveinse opened this issue Jan 14, 2024 · 17 comments · Fixed by #111
Closed
Tracked by #130

Appropriate retry mechanism when Zaptec cloud is not working #90

sveinse opened this issue Jan 14, 2024 · 17 comments · Fixed by #111
Labels
enhancement New feature or request

Comments

@sveinse
Copy link
Collaborator

sveinse commented Jan 14, 2024

Discussion: What is the appropriate action when Zaptec cloud is not responding or returns error?

Currently the Zaptec integration will log them to the HA system log. This can easily be interpreted that its the zaptec integration which is at fault. Typical errors that are observed are timeouts and HTTP error code 500 (internal server error). When this happens, quite many errors and warnings are logged.

The Home Assistant integration Quality scale states that a integration should (for silver):

  • Handles internet unavailable. Log a warning once when unavailable, log once when reconnected.
  • Handles device/service unavailable. Log a warning once when unavailable, log once when reconnected.

So what should we do? Do we need to implement some way to determine Zaptec cloud's "healthyness" in order for our integration to reduce the error logging?

@svenakela
Copy link
Contributor

Could be solved by setting an attribute on the binary_sensor when an API call fails. Then log the problem. When the API is working again the time stamp can be cleared and a reconnected is logged.
That would make it easy to visualize if the API is down as well.

I've had some issue last two nights with charging stopped several times and I suspect APIs. Could have been detected with an error time stamp.

@Bluhme1
Copy link

Bluhme1 commented Jan 31, 2024

Agree. Zaptec API has been unstable during the last few days. Error 500.
Has given a lot of errors in the log

@sveinse
Copy link
Collaborator Author

sveinse commented Jan 31, 2024

Interestingly, there already exists entities for this: The "* Charger", "* Circuit" and "* Installation" entities. This entity exists to provide a place to store the legacy attributes and only reads "Connected". But it would be a very useful extension to this that they indeed show the status of the API connection. In hindsight it kinda seems obvious to have this feature.

Definitely something to put on the list! Thanks.

@sveinse sveinse added the enhancement New feature or request label Jan 31, 2024
@Bluhme1
Copy link

Bluhme1 commented Feb 23, 2024

Still getting Time Outs like:

Denne fejl stammer fra en brugerdefineret integration.

Logger: custom_components.zaptec.api
Source: custom_components/zaptec/api.py:855
Integration: Zaptec EV charger (documentation)
First occurred: 16.52.35 (1 occurrences)
Last logged: 16.52.35

Request to https://api.zaptec.com/api/installation/bf68a41d-d2f9-44f5-bfc6-20bf42e645a8 failed (attempt 1): TimeoutError:

Perhaps the Zaptec integration should be more patient before reporting an error. After a short while the connection is established

@Bluhme1
Copy link

Bluhme1 commented May 5, 2024

Hi @sveinse

Did you get any further in your considerations to solve this issue. A pity that a very useful integration produces so many errors

@JTT22
Copy link

JTT22 commented Jul 29, 2024

How do we get someone working on this my logs are unusable...

@Hellowlol
Copy link
Collaborator

Do you have api debug on? Anyway, it think the proper way way here so is to retry using a librays

@JTT22
Copy link

JTT22 commented Jul 30, 2024

I get mostly just timeout errors or the below thats flooding:

custom_components.zaptec.api.RequestError: GET request to https://api.zaptec.com/api/circuits/KEYREMOVED failed with status 500: <ClientResponse(https://api.zaptec.com/api/circuits/KEYREMOVED) [500 Internal Server Error]>
<CIMultiDictProxy('Date': 'Tue, 30 Jul 2024 08:32:55 GMT', 'Content-Length': '0', 'Connection': 'keep-alive', 'Request-Context': ‘appId=cid-KEYREMOVED', 'Strict-Transport-Security': 'max-age=15724800; includeSubDomains')>

@sveinse
Copy link
Collaborator Author

sveinse commented Jul 30, 2024

I'm sorry, I haven't had time to look at this.

Currently the design logs every time an API request to Zaptec cloud fails, so when the connection is unstable it will result in many log entries. I agree that this should be changed. The HA quality scale requires for Silver that "Handles device/service unavailable. Log a warning once when unavailable, log once when reconnected."

@sveinse
Copy link
Collaborator Author

sveinse commented Aug 7, 2024

@Bluhme1, @JTT22 Thank you for your patience. I've merged a fix into master that should reduce the logging to a minimum. Can you please test the "master" version and see if this fixed for you, please?

@Hellowlol
Copy link
Collaborator

The best way to deal with this is to add retry on 500 errors so you sjampoen the exception and just retry later. Can someone post the entire blueprint?

sveinse added a commit to sveinse/zaptec that referenced this issue Aug 8, 2024
sveinse added a commit to sveinse/zaptec that referenced this issue Aug 8, 2024
@sveinse
Copy link
Collaborator Author

sveinse commented Aug 8, 2024

@Hellowlol In my experience, when Zaptec cloud starts giving 500, it will do so for quite a while. Nevertheless, I created #111 which will retry up to 5 times before giving up. What do you think? If this makes sense I can merge it to master and we can have it tested by the users that are most prone to this problem.

The main objective of this issue was to control the extensive logging that the API was producing when Zaptec cloud fails. We can't really do anything about availability of the Zaptec cloud. With the fix in #110, the API will no longer log itself, but rather leave it to the HA UpdateCoordinator to handle available/not-available messages.

@svenakela
Copy link
Contributor

I nice thing to do on retries is an exponential delay until next retry. Otherwise all clients will spam the API when it least need a flood of requests. It can be done with the backoff decorator for example, or by code.

@sveinse
Copy link
Collaborator Author

sveinse commented Aug 8, 2024

Yes, that is a good idea. However, there are limits to how long time these backoffs may be. A full poll from the HA zaptec integration to the Zaptec cloud requires quite many requests. By default this happens every 60 seconds, and any backoffs that enters delays into minutes, will probably interfere with HAs update interval. I'm not sure how the HA UpdateCoordinator responds to that.

sveinse added a commit to sveinse/zaptec that referenced this issue Oct 18, 2024
@sveinse
Copy link
Collaborator Author

sveinse commented Oct 18, 2024

The PR has now been updated with an exponential backoff mechanism.

@sveinse sveinse changed the title Appropriate action when Zaptec cloud is not working Appropriate retry mechanism when Zaptec cloud is not working Oct 20, 2024
@sveinse sveinse mentioned this issue Oct 20, 2024
13 tasks
sveinse added a commit that referenced this issue Oct 20, 2024
* Retry on API error code 500 (#90)
* Implement expenential backoff on requests
* Fix typing bug in set_authentication_required()
* Clean up code
@sveinse
Copy link
Collaborator Author

sveinse commented Oct 22, 2024

A fix for the issue have been pushed to master. Please install the "master" version and test if this fixes the observed problems. @svenakela @Bluhme1 @JTT22

@sveinse
Copy link
Collaborator Author

sveinse commented Oct 22, 2024

It seems HACS have removed the option of downloading "master", so I've created a beta/pre-release "0.7.2b1" available for download and test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants