Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize push monitor heartbeats to api calls (fixes #1422) #1428

Merged
merged 7 commits into from
May 29, 2022

Conversation

kaysond
Copy link
Contributor

@kaysond kaysond commented Mar 28, 2022

Description

Fixes #1422.

The goal is to avoid the race condition caused by creating heartbeats and checking them in two different, asynchronous functions, which is exacerbated by the possible drift of heartbeat checking due to setTimeout behavior.

The basic idea is that before we do a setTimeout(beat), we check when the last heartbeat arrived, and adjust the timeout such that the next check happens 1 heartbeat interval after the last heartbeat. This prevents the heartbeat check from drifting relative to the api call (which usually happens on a fixed interval like a cron job).

A 1s buffer time is added to that timeout to ensure that the heartbeat check happens after the api call is expected. This helps avoid the race condition that occurs when the heartbeat check and api call happen simultaneously (which they will tend to do). The buffer time may need to be increased, but the tradeoff is a delay in when the DOWN notification is sent in the event of a missed heartbeat.

It may be desirable to make the buffer time user-controlled.

Right now, retries aren't explicitly addressed. I would propose that in the case of missed heartbeats, the monitor reverts to checking exactly every heartbeat interval (i.e. ignoring the time of the last heartbeat).

This seemed to me to be the best way to address the issue without making huge architectural changes (i.e. make the heartbeat checking happen in the router function).

Please let me know if this all looks ok and I can finish it up, and finish testing, linting/formatting, etc.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • My code follows the style guidelines of this project
  • I ran ESLint and other linters for modified files
  • I have performed a self-review of my own code and tested it
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • My code needed automated testing. I have added them (this is optional task)

server/model/monitor.js Outdated Show resolved Hide resolved
server/model/monitor.js Outdated Show resolved Hide resolved
@Saibamen
Copy link
Contributor

Please use draft PR
https://github.blog/2019-02-14-introducing-draft-pull-requests/

@kaysond
Copy link
Contributor Author

kaysond commented Apr 4, 2022

Ok so this is updated and tested. There's still a remaining issue caused by the database not storing the datetimes with milliseconds. This is an issue with the redbean-node library (see louislam/redbean-node#6).

Once that's fixed, I'll retest this PR

@kaysond
Copy link
Contributor Author

kaysond commented Apr 4, 2022

Without milliseconds in the format, the time information in the database can't be used to determine intervals <1 second. For example, if you have one event at 12:15:24.999 and another at 12:15:45.001, the database will record them as 12:15:24 and 12:15:45, showing a 21s time difference even though its much closer to 20s (20.002).

This means that even if the api call happens within one heartbeat interval, but it the seconds in the timestamp is different, you'll get an incorrect down notification.

server/model/monitor.js Outdated Show resolved Hide resolved
@kaysond
Copy link
Contributor Author

kaysond commented Apr 8, 2022

Ok looks like with louislam/redbean-node#7 implemented, the last remaining problem is gone.

I did run into a weird issue when I tried to symlink my local redbean-node repo into node_modules where timezones weren't working properly, but when I manually copied the updated redbean-node/dist folder into uptime-kuma/node_modules/redbean-node, the issue went away.

Once the upstream is released, I'll make sure that things still work when it pulls the package from npm.

@kaysond kaysond mentioned this pull request Apr 12, 2022
2 tasks
@kaysond kaysond marked this pull request as ready for review April 12, 2022 22:25
@kaysond
Copy link
Contributor Author

kaysond commented Apr 12, 2022

@louislam this is ready to go!

FYI - I did notice that saving a push monitor causes it to resume, which will also cause a race condition with the api call. That's not addressed by this PR, but it only happens on user interaction, and even then its extremely rare. I don't think its really worth addressing.

Also, in case anyone was wondering, the issue I was seeing with timezones/dependencies was caused by a mismatch in the dayjs version used in uptime-kuma and redbean-node. For some reason the different versions caused a difference in the dayjs prototype between the monitor code and the dependency, so the below instanceof operator was returning false, even though by design it should've been true. This meant the database stored a non-UTC datetime

https://github.com/louislam/redbean-node/blob/3b1faa62a30a8141ac3fc0bbbe348ae51464941c/lib/redbean-node.ts#L997-L1004

@kaysond kaysond force-pushed the master branch 2 times, most recently from f240fcc to 08b1122 Compare April 12, 2022 22:37
@kaysond
Copy link
Contributor Author

kaysond commented Apr 12, 2022

Ok cleaned it up and ran the unit tests

@chakflying
Copy link
Collaborator

Logic seems sound and tested the 1 second buffer time works okay. Just need to run eslint fixes.

@kaysond
Copy link
Contributor Author

kaysond commented Apr 13, 2022

@chakflying fixed linting issues in monitor.js
The rest of the eslint errors are outside this PR

@kaysond kaysond force-pushed the master branch 4 times, most recently from 030d19e to 780449a Compare April 18, 2022 20:32
@kaysond kaysond changed the title [Draft] Synchronize push monitor heartbeats to api calls (fixes #1422) Synchronize push monitor heartbeats to api calls (fixes #1422) Apr 18, 2022
@kaysond
Copy link
Contributor Author

kaysond commented Apr 19, 2022

rebased. @louislam this is ready to be merged

@kaysond
Copy link
Contributor Author

kaysond commented May 26, 2022

The rest of the original PR is not needed.

You need the rest of it because otherwise heartbeats aren't stored in the database with ms precision, which will cause false DOWN's due to rounding errors.

I don't think your suggestion will actually work. At first glance you're going to run into some edge cases that will break things, but tbh I don't have time to test it. And while I appreciate that you've simplified the math, I wrote it the way I did specifically to be easy to read and understand, not necessarily to be simple or have fewer lines. IMO the PR still reads more easily in terms of understanding.

@quthla
Copy link

quthla commented May 26, 2022

The rest of the original PR is not needed.

You need the rest of it because otherwise heartbeats aren't stored in the database with ms precision, which will cause false DOWN's due to rounding errors.

Higher resolution doesn't hurt to have but is not needed because the heartbeat interval is in seconds anyway.

I don't think your suggestion will actually work. At first glance you're going to run into some edge cases that will break things, but tbh I don't have time to test it. And while I appreciate that you've simplified the math, I wrote it the way I did specifically to be easy to read and understand, not necessarily to be simple or have fewer lines. IMO the PR still reads more easily in terms of understanding.

I added a grace period of 500ms diff + 500 for safety reasons. I ran it initially without the 500ms and it still worked in all cases.

In your PR the calls to log.debug also lack the first parameter and will not work.

@kaysond
Copy link
Contributor Author

kaysond commented May 27, 2022

Higher resolution doesn't hurt to have but is not needed because the heartbeat interval is in seconds anyway.

That's not correct. See this old comment for the basic explanation. It does matter to eliminate the race condition because of how the JS event loop and setTimeout works, as I've mentioned in other comments.

I added a grace period of 500ms diff + 500 for safety reasons. I ran it initially without the 500ms and it still worked in all cases.

I'm skeptical. You probably just didn't run it long enough for the race condition to happen. The value of the grace period is somewhat arbitrary. With the on-LAN cron job I was using, I didn't see any latencies >250ms, but I figured 1000ms gives a large safety margin for the worst conceivable situations (e.g. very resource-constrained service over WAN) with an insignificant hit to the notification delay.

In your PR the calls to log.debug also lack the first parameter and will not work.

Fixed, thanks. The logging call was totally different when I opened the PR 2 months ago. This is what happens when you have to keep rebasing...

@quthla
Copy link

quthla commented May 27, 2022

Higher resolution doesn't hurt to have but is not needed because the heartbeat interval is in seconds anyway.

That's not correct. See this old comment for the basic explanation. It does matter to eliminate the race condition because of how the JS event loop and setTimeout works, as I've mentioned in other comments.

A grace period of 1000ms should cover for that

I added a grace period of 500ms diff + 500 for safety reasons. I ran it initially without the 500ms and it still worked in all cases.

I'm skeptical. You probably just didn't run it long enough for the race condition to happen. The value of the grace period is somewhat arbitrary. With the on-LAN cron job I was using, I didn't see any latencies >250ms, but I figured 1000ms gives a large safety margin for the worst conceivable situations (e.g. very resource-constrained service over WAN) with an insignificant hit to the notification delay.

Ideally you would set the heartbeat interval to 21s instead of 20s to cover for cases like that.

In your PR the calls to log.debug also lack the first parameter and will not work.

Fixed, thanks. The logging call was totally different when I opened the PR 2 months ago. This is what happens when you have to keep rebasing...

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

@louislam louislam added this to the 1.17.0 milestone May 27, 2022
@louislam
Copy link
Owner

@quthla The new logging function was merged recently, it is acceptable.

I will try to fully focus on this pr this weekend.

@kaysond
Copy link
Contributor Author

kaysond commented May 27, 2022

A grace period of 1000ms should cover for that

It doesn't quite. If you don't have ms precision, you can get a ping at 10:10:10.999, but the db stores it at 10:10:10.000 (it doesn't round), so the monitor checks for the next ping at 10:10:31.000 (assuming 20s interval + 1s buffer). The next ping will also come at 10:10:30.999, so you end up back at the race condition. Obviously you can fix this with a >1s buffer, but that's more like addressing the symptom instead of the cause. You also need ms precision for sub-1s buffers to be meaningful, and I think long term it makes sense to have the buffer be user-defined with a lower default.

Ideally you would set the heartbeat interval to 21s instead of 20s to cover for cases like that.

You could already do that without this PR. I've been running that as a workaround since I reported the original issue. I think a common scenario is a 60s interval with cronjob ping, so it seems wise to support that without requiring users to fudge the interval to handle normal or even large latency (not to mention the JS event loop, which no one should have to worry about)

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

Yes. You can use the git blame to see which commit changed it. Looks like 4/13.

@quthla
Copy link

quthla commented May 27, 2022

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

Yes. You can use the git blame to see which commit changed it. Looks like 4/13.

And yet that PR only changed the call from log_info to log.info but the logging function always had 2 parameters since being introduced back in 2021. Anyways, glad you fixed it and looking forward to your PR being merged.

bdcdf47

@louislam
Copy link
Owner

@quthla Usually you have to check the pr merged date instead of commit date.

Let say we made a commit in 2015, and we can create a pull request and get merged in 2022.

@quthla
Copy link

quthla commented May 27, 2022

@louislam you're right. I was confused because it usually doesn't take months for rather simple PRs to be merged and I didn't check if the commit was part of a PR.

@kaysond
Copy link
Contributor Author

kaysond commented May 27, 2022

Are you sure? It seems logging always used these 2 parameters ever since it was introduced back in 2021.

Yes. You can use the git blame to see which commit changed it. Looks like 4/13.

And yet that PR only changed the call from log_info to log.info but the logging function always had 2 parameters since being introduced back in 2021. Anyways, glad you fixed it and looking forward to your PR being merged.

bdcdf47

@quthla for the record - this is what the file looked like when I wrote and submitted the PR

} else if (this.type === "push") { // Type: Push
const time = R.isoDateTime(dayjs.utc().subtract(this.interval, "second"));
let heartbeatCount = await R.count("heartbeat", " monitor_id = ? AND time > ? ", [
this.id,
time
]);
debug("heartbeatCount" + heartbeatCount + " " + time);
if (heartbeatCount <= 0) {
// Fix #922, since previous heartbeat could be inserted by api, it should get from database
previousBeat = await Monitor.getPreviousHeartbeat(this.id);
throw new Error("No heartbeat in the time window");
} else {
// No need to insert successful heartbeat for push type, so end here
retries = 0;
this.heartbeatInterval = setTimeout(beat, beatInterval * 1000);
return;
}

@louislam louislam modified the milestones: 1.17.0, 1.16.1 May 28, 2022
@louislam
Copy link
Owner

louislam commented May 28, 2022

I understand this pr now. Just found a small issue.

If I don't call the push url for the first tick, the error msg is undefined.

image

@kaysond
Copy link
Contributor Author

kaysond commented May 29, 2022

I understand this pr now. Just found a small issue.

If I don't call the push url for the first tick, the error msg is undefined.

image

Fixed!

@louislam
Copy link
Owner

Ready to go, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race condition in push monitors
9 participants