-
Notifications
You must be signed in to change notification settings - Fork 842
firstSuccess in health checks is incorrect #1082
Comments
Looks like there has been a failover in the meantime. Health check results are currently not being persisted. |
Looks like first node was a leader for the whole time.
|
That's pretty strange. The relevant code is here: and here So the only possibility for the date of the first health check to change is when the health check actor is being restarted. That should only happen on failover. Could you please search the logs for leader election entries? |
|
Is this on 0.7.6 or 0.8.0-RC1? |
This is master, f7f9a00 as HEAD. |
Okay, I have been able to reproduce this. |
@bobrik will those turn green eventually? Or do they stay like that forever? |
It turned all green when I checked in an hour. |
In an hour? that's long :) With the usual health check intervals? |
It might've happened faster, I just checked in an hour. I can probably check logs for exact times tomorrow. Health checks for this app happen every 2 seconds, deployment is very fast too: 20mb docker container. |
Don't worry. Will try to reproduce this. |
@bobrik Which kind of health check was it? HTTP? Command? Can you reproduce it with a mesos containerizer? |
Tried HTTP and COMMAND, cannot reproduce. |
Reproduced that with initial deployment. Health checks are TCP. No health checks performed:
{
"app": {
"id": "/topface/test/wtf",
"cmd": "/app -listen 0.0.0.0:$PORT",
"args": null,
"user": null,
"env": {},
"instances": 2,
"cpus": 0.2,
"mem": 64,
"disk": 0,
"executor": "",
"constraints": [
[
"hostname",
"UNIQUE"
]
],
"uris": [],
"storeUrls": [],
"ports": [
18001
],
"requirePorts": false,
"backoffSeconds": 5,
"backoffFactor": 1,
"maxLaunchDelaySeconds": 3600,
"container": {
"type": "DOCKER",
"volumes": [
{
"containerPath": "/etc/ssl/certs",
"hostPath": "/etc/ssl/certs",
"mode": "RO"
}
],
"docker": {
"image": "docker.core.tf/scruffy:2015-01-18.1",
"privileged": false,
"parameters": []
}
},
"healthChecks": [
{
"path": "/",
"protocol": "TCP",
"portIndex": 0,
"gracePeriodSeconds": 15,
"intervalSeconds": 2,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 3
}
],
"dependencies": [],
"upgradeStrategy": {
"minimumHealthCapacity": 1,
"maximumOverCapacity": 1
},
"labels": {},
"version": "2015-01-27T07:13:40.196Z",
"tasksStaged": 0,
"tasksRunning": 2,
"tasksHealthy": 0,
"tasksUnhealthy": 0,
"deployments": [],
"tasks": [
{
"id": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799",
"host": "web323",
"ports": [
31650
],
"startedAt": "2015-01-27T07:13:45.488Z",
"stagedAt": "2015-01-27T07:13:41.684Z",
"version": "2015-01-27T07:13:40.196Z",
"appId": "/topface/test/wtf"
},
{
"id": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799",
"host": "web491",
"ports": [
31778
],
"startedAt": "2015-01-27T07:13:45.516Z",
"stagedAt": "2015-01-27T07:13:41.618Z",
"version": "2015-01-27T07:13:40.196Z",
"appId": "/topface/test/wtf"
}
],
"lastTaskFailure": null
}
} Here they are:
{
"app": {
"id": "/topface/test/wtf",
"cmd": "/app -listen 0.0.0.0:$PORT",
"args": null,
"user": null,
"env": {},
"instances": 2,
"cpus": 0.2,
"mem": 64,
"disk": 0,
"executor": "",
"constraints": [
[
"hostname",
"UNIQUE"
]
],
"uris": [],
"storeUrls": [],
"ports": [
18001
],
"requirePorts": false,
"backoffSeconds": 5,
"backoffFactor": 1,
"maxLaunchDelaySeconds": 3600,
"container": {
"type": "DOCKER",
"volumes": [
{
"containerPath": "/etc/ssl/certs",
"hostPath": "/etc/ssl/certs",
"mode": "RO"
}
],
"docker": {
"image": "docker.core.tf/scruffy:2015-01-18.1",
"privileged": false,
"parameters": []
}
},
"healthChecks": [
{
"path": "/",
"protocol": "TCP",
"portIndex": 0,
"gracePeriodSeconds": 15,
"intervalSeconds": 2,
"timeoutSeconds": 5,
"maxConsecutiveFailures": 3
}
],
"dependencies": [],
"upgradeStrategy": {
"minimumHealthCapacity": 1,
"maximumOverCapacity": 1
},
"labels": {},
"version": "2015-01-27T07:13:40.196Z",
"tasksStaged": 0,
"tasksRunning": 2,
"tasksHealthy": 2,
"tasksUnhealthy": 0,
"deployments": [],
"tasks": [
{
"id": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799",
"host": "web323",
"ports": [
31650
],
"startedAt": "2015-01-27T07:13:45.488Z",
"stagedAt": "2015-01-27T07:13:41.684Z",
"version": "2015-01-27T07:13:40.196Z",
"appId": "/topface/test/wtf",
"healthCheckResults": [
{
"alive": true,
"consecutiveFailures": 0,
"firstSuccess": "2015-01-27T07:17:53.424Z",
"lastFailure": null,
"lastSuccess": "2015-01-27T07:18:09.585Z",
"taskId": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799"
}
]
},
{
"id": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799",
"host": "web491",
"ports": [
31778
],
"startedAt": "2015-01-27T07:13:45.516Z",
"stagedAt": "2015-01-27T07:13:41.618Z",
"version": "2015-01-27T07:13:40.196Z",
"appId": "/topface/test/wtf",
"healthCheckResults": [
{
"alive": true,
"consecutiveFailures": 0,
"firstSuccess": "2015-01-27T07:17:53.424Z",
"lastFailure": null,
"lastSuccess": "2015-01-27T07:18:09.585Z",
"taskId": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799"
}
]
}
],
"lastTaskFailure": null
}
} Deployment was finished a while ago, but health checks didn't appear. Health checks in marathon log started much sooner.
I'm using master with your PR merged for bars in UI. |
I'm seeing similar issues after scaling up: Immediately after starting a new task, the JSON returned by the API does not include a |
@c089 How does your app json look like? |
The complete one reported by the API or the one I used to create the app with? |
The one to create the app. |
Mostly using defaults stripped it down to the minimum I needed to deploy and I'm also using docker containers. Generated from this JS object:
|
Weird, this happens only in production cluster. I tried the same marathon image with different data path in zk and health checks appear immediately. Here are logs from prodution master (grepped for wtf): https://gist.github.com/bobrik/dd2fdbf49284dd189e19 Interesting stuff starts at |
The interesting line is
Why does Marathon start another health check actor? It was started at the top already. |
I've seen this without any failover, even on 1-node cluster. |
And I'm still seeing this on 3-node production cluster. I just updated to current master and the issue is still there. I tried deploying an app with different paths: I tried different app on
Looks like state of my cluster got broken somehow. I think it's better to figure out what went wrong instead of just nuking current state with hope that it won't happen again. I can add more logging to provide you with more info. Getting zk dump is also possible. |
Well, at least deployment waits for new tasks to become healthy. This issue should be open anyway. |
Interesting observation: when
More context:
@drexin, @ConnorDoyle please reopen the issue. |
What's the marathon log at the same moment? |
https://gist.github.com/bobrik/dcca1f2c375f7d1399ea I removed http logs and unrelevant apps. |
Look at these lines:
This means that both In order to get a better understanding, could you try to reproduce the problem with this patch? https://gist.github.com/anonymous/f67894498e5df2cababd |
Sure, here is the log: https://gist.github.com/bobrik/fbf5b36309e5899530df Lines are filtered like before. |
Look at the
We have visually the same key. If we assume that the Scala map is not broken, the equality of AppVersion seems to give something we don't expect. |
Huh, I deployed several times and health checks accumulated on each deploy:
|
Timestamp's hashCode was based on the parameter dataTime which might have a timezone != UTC. The consequence: val t = Timestamp.now t.hashCode != Timestamp(t.toString).hashCode This patch overrides hashCode to use the normalize time variable of the Timestamp case class. This probably fixes d2iq-archive#1082
Timestamp's hashCode was based on the parameter dataTime which might have a timezone != UTC. The consequence: val t = Timestamp.now t.hashCode != Timestamp(t.toString).hashCode This patch overrides hashCode to use the UTC normalized time variable of the Timestamp case class. This probably fixes d2iq-archive#1082
@bobrik Please test the upper patch. I think this fixes your issue. |
Timestamp's hashCode was based on the parameter dataTime which might have a timezone != UTC. The consequence: val t = Timestamp.now t.hashCode != Timestamp(t.toString).hashCode This patch overrides hashCode to use the UTC normalized time variable of the Timestamp case class. This probably fixes d2iq-archive#1082
I'm running marathon in UTC, but I'll give it a try. |
I think it's about the chronology parameter in general inside of DateTime. There might be other chronologies which are actually equivalent to UTC, but still giving a different hashCode. Let's see, only a theory for now. In my testing here, I did this: val t = Timestamp.now
t.hashCode
Timestamp(t.toString).hashCode The two hash codes differed. |
OMG IT WORKED! Beers are on me next time I meet you in person 🍺 |
:-) |
Nice find @sttts |
Wondering how many other strange behaviors were triggered by this bug.... a map which isn't a map. |
I wonder why it worked in the first place, with 1-2 apps it works pretty well. |
Is is because |
Yes, can be. I guess Scala's Map implementation uses |
I'm running master, as usual.
First success for task in ui is reported at
13:47:32
(UTC+4 #1038):But in logs first success happened much faster:
The last started task in this app was actually started at ~13:47. It looks like health checks are not saved until all tasks are running. In UI health check bullet is gray before each task is running in current revision.
The text was updated successfully, but these errors were encountered: