-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize push monitor heartbeats to api calls (fixes #1422) #1428
Conversation
Please use draft PR |
4dc16c7
to
85a5d9c
Compare
Ok so this is updated and tested. There's still a remaining issue caused by the database not storing the datetimes with milliseconds. This is an issue with the Once that's fixed, I'll retest this PR |
Without milliseconds in the format, the time information in the database can't be used to determine intervals <1 second. For example, if you have one event at 12:15:24.999 and another at 12:15:45.001, the database will record them as 12:15:24 and 12:15:45, showing a 21s time difference even though its much closer to 20s (20.002). This means that even if the api call happens within one heartbeat interval, but it the seconds in the timestamp is different, you'll get an incorrect down notification. |
Ok looks like with louislam/redbean-node#7 implemented, the last remaining problem is gone. I did run into a weird issue when I tried to symlink my local Once the upstream is released, I'll make sure that things still work when it pulls the package from npm. |
@louislam this is ready to go! FYI - I did notice that saving a push monitor causes it to resume, which will also cause a race condition with the api call. That's not addressed by this PR, but it only happens on user interaction, and even then its extremely rare. I don't think its really worth addressing. Also, in case anyone was wondering, the issue I was seeing with timezones/dependencies was caused by a mismatch in the |
f240fcc
to
08b1122
Compare
Ok cleaned it up and ran the unit tests |
Logic seems sound and tested the 1 second buffer time works okay. Just need to run eslint fixes. |
@chakflying fixed linting issues in monitor.js |
030d19e
to
780449a
Compare
rebased. @louislam this is ready to be merged |
You need the rest of it because otherwise heartbeats aren't stored in the database with ms precision, which will cause false DOWN's due to rounding errors. I don't think your suggestion will actually work. At first glance you're going to run into some edge cases that will break things, but tbh I don't have time to test it. And while I appreciate that you've simplified the math, I wrote it the way I did specifically to be easy to read and understand, not necessarily to be simple or have fewer lines. IMO the PR still reads more easily in terms of understanding. |
Higher resolution doesn't hurt to have but is not needed because the heartbeat interval is in seconds anyway.
I added a grace period of 500ms In your PR the calls to |
That's not correct. See this old comment for the basic explanation. It does matter to eliminate the race condition because of how the JS event loop and
I'm skeptical. You probably just didn't run it long enough for the race condition to happen. The value of the grace period is somewhat arbitrary. With the on-LAN cron job I was using, I didn't see any latencies >250ms, but I figured 1000ms gives a large safety margin for the worst conceivable situations (e.g. very resource-constrained service over WAN) with an insignificant hit to the notification delay.
Fixed, thanks. The logging call was totally different when I opened the PR 2 months ago. This is what happens when you have to keep rebasing... |
A grace period of 1000ms should cover for that
Ideally you would set the heartbeat interval to 21s instead of 20s to cover for cases like that.
Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021. |
@quthla The new logging function was merged recently, it is acceptable. I will try to fully focus on this pr this weekend. |
It doesn't quite. If you don't have ms precision, you can get a ping at
You could already do that without this PR. I've been running that as a workaround since I reported the original issue. I think a common scenario is a 60s interval with cronjob ping, so it seems wise to support that without requiring users to fudge the interval to handle normal or even large latency (not to mention the JS event loop, which no one should have to worry about)
Yes. You can use the |
And yet that PR only changed the call from |
@quthla Usually you have to check the pr merged date instead of commit date. Let say we made a commit in 2015, and we can create a pull request and get merged in 2022. |
@louislam you're right. I was confused because it usually doesn't take months for rather simple PRs to be merged and I didn't check if the commit was part of a PR. |
@quthla for the record - this is what the file looked like when I wrote and submitted the PR uptime-kuma/server/model/monitor.js Lines 309 to 329 in be88351
|
# Conflicts: # server/model/monitor.js
Ready to go, thanks! |
Description
Fixes #1422.
The goal is to avoid the race condition caused by creating heartbeats and checking them in two different, asynchronous functions, which is exacerbated by the possible drift of heartbeat checking due to
setTimeout
behavior.The basic idea is that before we do a
setTimeout(beat)
, we check when the last heartbeat arrived, and adjust the timeout such that the next check happens 1 heartbeat interval after the last heartbeat. This prevents the heartbeat check from drifting relative to the api call (which usually happens on a fixed interval like a cron job).A 1s buffer time is added to that timeout to ensure that the heartbeat check happens after the api call is expected. This helps avoid the race condition that occurs when the heartbeat check and api call happen simultaneously (which they will tend to do). The buffer time may need to be increased, but the tradeoff is a delay in when the DOWN notification is sent in the event of a missed heartbeat.
It may be desirable to make the buffer time user-controlled.
Right now, retries aren't explicitly addressed. I would propose that in the case of missed heartbeats, the monitor reverts to checking exactly every heartbeat interval (i.e. ignoring the time of the last heartbeat).
This seemed to me to be the best way to address the issue without making huge architectural changes (i.e. make the heartbeat checking happen in the router function).
Please let me know if this all looks ok and I can finish it up, and finish testing, linting/formatting, etc.
Type of change
Checklist