Synchronize push monitor heartbeats to api calls (fixes #1422) #1428

kaysond · 2022-03-28T21:39:57Z

Description

The goal is to avoid the race condition caused by creating heartbeats and checking them in two different, asynchronous functions, which is exacerbated by the possible drift of heartbeat checking due to setTimeout behavior.

The basic idea is that before we do a setTimeout(beat), we check when the last heartbeat arrived, and adjust the timeout such that the next check happens 1 heartbeat interval after the last heartbeat. This prevents the heartbeat check from drifting relative to the api call (which usually happens on a fixed interval like a cron job).

A 1s buffer time is added to that timeout to ensure that the heartbeat check happens after the api call is expected. This helps avoid the race condition that occurs when the heartbeat check and api call happen simultaneously (which they will tend to do). The buffer time may need to be increased, but the tradeoff is a delay in when the DOWN notification is sent in the event of a missed heartbeat.

It may be desirable to make the buffer time user-controlled.

Right now, retries aren't explicitly addressed. I would propose that in the case of missed heartbeats, the monitor reverts to checking exactly every heartbeat interval (i.e. ignoring the time of the last heartbeat).

This seemed to me to be the best way to address the issue without making huge architectural changes (i.e. make the heartbeat checking happen in the router function).

Please let me know if this all looks ok and I can finish it up, and finish testing, linting/formatting, etc.

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist

My code follows the style guidelines of this project
I ran ESLint and other linters for modified files
I have performed a self-review of my own code and tested it
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
My code needed automated testing. I have added them (this is optional task)

server/model/monitor.js

Saibamen · 2022-03-30T21:58:53Z

Please use draft PR
https://github.blog/2019-02-14-introducing-draft-pull-requests/

kaysond · 2022-04-04T03:26:42Z

Ok so this is updated and tested. There's still a remaining issue caused by the database not storing the datetimes with milliseconds. This is an issue with the redbean-node library (see louislam/redbean-node#6).

Once that's fixed, I'll retest this PR

kaysond · 2022-04-04T03:33:06Z

Without milliseconds in the format, the time information in the database can't be used to determine intervals <1 second. For example, if you have one event at 12:15:24.999 and another at 12:15:45.001, the database will record them as 12:15:24 and 12:15:45, showing a 21s time difference even though its much closer to 20s (20.002).

This means that even if the api call happens within one heartbeat interval, but it the seconds in the timestamp is different, you'll get an incorrect down notification.

server/model/monitor.js

kaysond · 2022-04-08T03:54:36Z

Ok looks like with louislam/redbean-node#7 implemented, the last remaining problem is gone.

I did run into a weird issue when I tried to symlink my local redbean-node repo into node_modules where timezones weren't working properly, but when I manually copied the updated redbean-node/dist folder into uptime-kuma/node_modules/redbean-node, the issue went away.

Once the upstream is released, I'll make sure that things still work when it pulls the package from npm.

kaysond · 2022-04-12T22:31:39Z

@louislam this is ready to go!

FYI - I did notice that saving a push monitor causes it to resume, which will also cause a race condition with the api call. That's not addressed by this PR, but it only happens on user interaction, and even then its extremely rare. I don't think its really worth addressing.

Also, in case anyone was wondering, the issue I was seeing with timezones/dependencies was caused by a mismatch in the dayjs version used in uptime-kuma and redbean-node. For some reason the different versions caused a difference in the dayjs prototype between the monitor code and the dependency, so the below instanceof operator was returning false, even though by design it should've been true. This meant the database stored a non-UTC datetime

https://github.com/louislam/redbean-node/blob/3b1faa62a30a8141ac3fc0bbbe348ae51464941c/lib/redbean-node.ts#L997-L1004

kaysond · 2022-04-12T22:38:21Z

Ok cleaned it up and ran the unit tests

chakflying · 2022-04-13T04:49:16Z

Logic seems sound and tested the 1 second buffer time works okay. Just need to run eslint fixes.

kaysond · 2022-04-13T08:13:53Z

@chakflying fixed linting issues in monitor.js
The rest of the eslint errors are outside this PR

kaysond · 2022-04-19T04:28:36Z

rebased. @louislam this is ready to be merged

server/model/monitor.js

kaysond · 2022-05-26T19:30:38Z

The rest of the original PR is not needed.

You need the rest of it because otherwise heartbeats aren't stored in the database with ms precision, which will cause false DOWN's due to rounding errors.

I don't think your suggestion will actually work. At first glance you're going to run into some edge cases that will break things, but tbh I don't have time to test it. And while I appreciate that you've simplified the math, I wrote it the way I did specifically to be easy to read and understand, not necessarily to be simple or have fewer lines. IMO the PR still reads more easily in terms of understanding.

quthla · 2022-05-26T19:43:40Z

The rest of the original PR is not needed.

You need the rest of it because otherwise heartbeats aren't stored in the database with ms precision, which will cause false DOWN's due to rounding errors.

Higher resolution doesn't hurt to have but is not needed because the heartbeat interval is in seconds anyway.

I don't think your suggestion will actually work. At first glance you're going to run into some edge cases that will break things, but tbh I don't have time to test it. And while I appreciate that you've simplified the math, I wrote it the way I did specifically to be easy to read and understand, not necessarily to be simple or have fewer lines. IMO the PR still reads more easily in terms of understanding.

I added a grace period of 500ms diff + 500 for safety reasons. I ran it initially without the 500ms and it still worked in all cases.

In your PR the calls to log.debug also lack the first parameter and will not work.

kaysond · 2022-05-27T04:47:01Z

Higher resolution doesn't hurt to have but is not needed because the heartbeat interval is in seconds anyway.

That's not correct. See this old comment for the basic explanation. It does matter to eliminate the race condition because of how the JS event loop and setTimeout works, as I've mentioned in other comments.

I added a grace period of 500ms diff + 500 for safety reasons. I ran it initially without the 500ms and it still worked in all cases.

I'm skeptical. You probably just didn't run it long enough for the race condition to happen. The value of the grace period is somewhat arbitrary. With the on-LAN cron job I was using, I didn't see any latencies >250ms, but I figured 1000ms gives a large safety margin for the worst conceivable situations (e.g. very resource-constrained service over WAN) with an insignificant hit to the notification delay.

In your PR the calls to log.debug also lack the first parameter and will not work.

Fixed, thanks. The logging call was totally different when I opened the PR 2 months ago. This is what happens when you have to keep rebasing...

quthla · 2022-05-27T09:59:20Z

Higher resolution doesn't hurt to have but is not needed because the heartbeat interval is in seconds anyway.

That's not correct. See this old comment for the basic explanation. It does matter to eliminate the race condition because of how the JS event loop and setTimeout works, as I've mentioned in other comments.

A grace period of 1000ms should cover for that

I added a grace period of 500ms diff + 500 for safety reasons. I ran it initially without the 500ms and it still worked in all cases.

I'm skeptical. You probably just didn't run it long enough for the race condition to happen. The value of the grace period is somewhat arbitrary. With the on-LAN cron job I was using, I didn't see any latencies >250ms, but I figured 1000ms gives a large safety margin for the worst conceivable situations (e.g. very resource-constrained service over WAN) with an insignificant hit to the notification delay.

Ideally you would set the heartbeat interval to 21s instead of 20s to cover for cases like that.

In your PR the calls to log.debug also lack the first parameter and will not work.

Fixed, thanks. The logging call was totally different when I opened the PR 2 months ago. This is what happens when you have to keep rebasing...

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

louislam · 2022-05-27T12:09:14Z

@quthla The new logging function was merged recently, it is acceptable.

I will try to fully focus on this pr this weekend.

kaysond · 2022-05-27T17:00:49Z

A grace period of 1000ms should cover for that

It doesn't quite. If you don't have ms precision, you can get a ping at 10:10:10.999, but the db stores it at 10:10:10.000 (it doesn't round), so the monitor checks for the next ping at 10:10:31.000 (assuming 20s interval + 1s buffer). The next ping will also come at 10:10:30.999, so you end up back at the race condition. Obviously you can fix this with a >1s buffer, but that's more like addressing the symptom instead of the cause. You also need ms precision for sub-1s buffers to be meaningful, and I think long term it makes sense to have the buffer be user-defined with a lower default.

Ideally you would set the heartbeat interval to 21s instead of 20s to cover for cases like that.

You could already do that without this PR. I've been running that as a workaround since I reported the original issue. I think a common scenario is a 60s interval with cronjob ping, so it seems wise to support that without requiring users to fudge the interval to handle normal or even large latency (not to mention the JS event loop, which no one should have to worry about)

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

Yes. You can use the git blame to see which commit changed it. Looks like 4/13.

quthla · 2022-05-27T17:15:48Z

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

Yes. You can use the git blame to see which commit changed it. Looks like 4/13.

And yet that PR only changed the call from log_info to log.info but the logging function always had 2 parameters since being introduced back in 2021. Anyways, glad you fixed it and looking forward to your PR being merged.

bdcdf47

louislam · 2022-05-27T17:48:24Z

@quthla Usually you have to check the pr merged date instead of commit date.

Let say we made a commit in 2015, and we can create a pull request and get merged in 2022.

quthla · 2022-05-27T17:54:51Z

@louislam you're right. I was confused because it usually doesn't take months for rather simple PRs to be merged and I didn't check if the commit was part of a PR.

kaysond · 2022-05-27T17:56:34Z

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

Yes. You can use the git blame to see which commit changed it. Looks like 4/13.

And yet that PR only changed the call from log_info to log.info but the logging function always had 2 parameters since being introduced back in 2021. Anyways, glad you fixed it and looking forward to your PR being merged.

bdcdf47

@quthla for the record - this is what the file looked like when I wrote and submitted the PR

uptime-kuma/server/model/monitor.js

Lines 309 to 329 in be88351

    
           } else if (this.type === "push") {      // Type: Push 
        
               const time = R.isoDateTime(dayjs.utc().subtract(this.interval, "second")); 
        
               let heartbeatCount = await R.count("heartbeat", " monitor_id = ? AND time > ? ", [ 
        
                   this.id, 
        
                   time 
        
               ]); 
        
               debug("heartbeatCount" + heartbeatCount + " " + time); 
        
               if (heartbeatCount <= 0) { 
        
                   // Fix #922, since previous heartbeat could be inserted by api, it should get from database 
        
                   previousBeat = await Monitor.getPreviousHeartbeat(this.id); 
        
                   throw new Error("No heartbeat in the time window"); 
        
               } else { 
        
                   // No need to insert successful heartbeat for push type, so end here 
        
                   retries = 0; 
        
                   this.heartbeatInterval = setTimeout(beat, beatInterval * 1000); 
        
                   return; 
        
               }

# Conflicts: # server/model/monitor.js

louislam · 2022-05-28T16:44:45Z

I understand this pr now. Just found a small issue.

If I don't call the push url for the first tick, the error msg is undefined.

kaysond · 2022-05-29T02:59:52Z

I understand this pr now. Just found a small issue.

If I don't call the push url for the first tick, the error msg is undefined.

Fixed!

louislam · 2022-05-29T04:34:00Z

Ready to go, thanks!

Saibamen reviewed Mar 30, 2022

View reviewed changes

server/model/monitor.js Outdated Show resolved Hide resolved

server/model/monitor.js Outdated Show resolved Hide resolved

kaysond marked this pull request as draft March 30, 2022 23:46

kaysond force-pushed the master branch 3 times, most recently from 4dc16c7 to 85a5d9c Compare April 4, 2022 02:27

kaysond mentioned this pull request Apr 4, 2022

isoDateTime() and isoTime() don't use milliseconds which causes issues for uptime-kuma push notifications louislam/redbean-node#6

Closed

Saibamen reviewed Apr 4, 2022

View reviewed changes

server/model/monitor.js Outdated Show resolved Hide resolved

kaysond mentioned this pull request Apr 7, 2022

Add isoDateTimeMillis and isoTimeMillis louislam/redbean-node#7

Merged

kaysond force-pushed the master branch from cf27eab to 17c991f Compare April 8, 2022 03:56

kaysond mentioned this pull request Apr 12, 2022

Race condition in push monitors #1422

Closed

2 tasks

kaysond force-pushed the master branch from 17c991f to 0ea1f65 Compare April 12, 2022 22:25

kaysond marked this pull request as ready for review April 12, 2022 22:25

kaysond force-pushed the master branch 2 times, most recently from f240fcc to 08b1122 Compare April 12, 2022 22:37

chakflying approved these changes Apr 13, 2022

View reviewed changes

kaysond force-pushed the master branch 4 times, most recently from 030d19e to 780449a Compare April 18, 2022 20:32

kaysond changed the title ~~[Draft] Synchronize push monitor heartbeats to api calls (fixes #1422)~~ Synchronize push monitor heartbeats to api calls (fixes #1422) Apr 18, 2022

wishdev reviewed Apr 26, 2022

View reviewed changes

server/model/monitor.js Show resolved Hide resolved

Fix log.debug calls

ab4edf2

louislam added this to the 1.17.0 milestone May 27, 2022

louislam added 2 commits May 28, 2022 23:19

Merge remote-tracking branch 'origin/master' into kaysond_master

df4ce81

# Conflicts: # server/model/monitor.js

Fix another log.debug call

7f46223

louislam modified the milestones: 1.17.0, 1.16.1 May 28, 2022

show correct down message for first tick

ea10d89

louislam merged commit 90fe25e into louislam:master May 29, 2022

louislam mentioned this pull request Jun 10, 2022

[push] reporting down when servcice is up in same second. #1747

Open

2 tasks

kaysond mentioned this pull request Jun 14, 2022

Fix upside down push monitors #1762

Merged

6 tasks

snyk-bot mentioned this pull request Jan 27, 2023

[Snyk] Upgrade nodemailer from 6.6.5 to 6.8.0 IMS-Network/uptime-kuma#3

Merged

thefourcraft mentioned this pull request Feb 23, 2023

[Snyk] Upgrade nodemailer from 6.6.5 to 6.9.1 IMS-Network/uptime-kuma#31

Closed

snyk-bot mentioned this pull request Apr 1, 2023

[Snyk] Upgrade nodemailer from 6.6.5 to 6.9.1 othmaneelmhassani/uptime-kuma#2

Open

chakflying mentioned this pull request May 21, 2023

Some questions about the project and vision #3180

Closed

2 tasks

chakflying mentioned this pull request Jul 30, 2023

Monitor stucks at PENDING #1355

Closed

2 tasks

chakflying mentioned this pull request Aug 18, 2023

Understanding Push Monitor Type Timing #3599

Closed

2 tasks

CommanderStorm mentioned this pull request Jan 3, 2024

allow push monitors to update prometheus #4318

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize push monitor heartbeats to api calls (fixes #1422) #1428

Synchronize push monitor heartbeats to api calls (fixes #1422) #1428

kaysond commented Mar 28, 2022 •

edited

Loading

Saibamen commented Mar 30, 2022

kaysond commented Apr 4, 2022

kaysond commented Apr 4, 2022 •

edited

Loading

kaysond commented Apr 8, 2022

kaysond commented Apr 12, 2022

kaysond commented Apr 12, 2022

chakflying commented Apr 13, 2022

kaysond commented Apr 13, 2022

kaysond commented Apr 19, 2022

kaysond commented May 26, 2022

quthla commented May 26, 2022

kaysond commented May 27, 2022

quthla commented May 27, 2022 •

edited

Loading

louislam commented May 27, 2022

kaysond commented May 27, 2022

quthla commented May 27, 2022

louislam commented May 27, 2022

quthla commented May 27, 2022

kaysond commented May 27, 2022

louislam commented May 28, 2022 •

edited

Loading

kaysond commented May 29, 2022

louislam commented May 29, 2022

Synchronize push monitor heartbeats to api calls (fixes #1422) #1428

Synchronize push monitor heartbeats to api calls (fixes #1422) #1428

Conversation

kaysond commented Mar 28, 2022 • edited Loading

Description

Type of change

Checklist

Saibamen commented Mar 30, 2022

kaysond commented Apr 4, 2022

kaysond commented Apr 4, 2022 • edited Loading

kaysond commented Apr 8, 2022

kaysond commented Apr 12, 2022

kaysond commented Apr 12, 2022

chakflying commented Apr 13, 2022

kaysond commented Apr 13, 2022

kaysond commented Apr 19, 2022

kaysond commented May 26, 2022

quthla commented May 26, 2022

kaysond commented May 27, 2022

quthla commented May 27, 2022 • edited Loading

louislam commented May 27, 2022

kaysond commented May 27, 2022

quthla commented May 27, 2022

louislam commented May 27, 2022

quthla commented May 27, 2022

kaysond commented May 27, 2022

louislam commented May 28, 2022 • edited Loading

kaysond commented May 29, 2022

louislam commented May 29, 2022

kaysond commented Mar 28, 2022 •

edited

Loading

kaysond commented Apr 4, 2022 •

edited

Loading

quthla commented May 27, 2022 •

edited

Loading

louislam commented May 28, 2022 •

edited

Loading