firstSuccess in health checks is incorrect #1082

bobrik · 2015-01-23T10:01:56Z

I'm running master, as usual.

First success for task in ui is reported at 13:47:32 (UTC+4 #1038):

But in logs first success happened much faster:

[2015-01-23 09:43:45,927] INFO Received status update for task topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[INFO] [01/23/2015 09:43:47.680] [marathon-akka.actor.default-dispatcher-19] [akka://marathon/user/$r] Received health result: [Unhealthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,ConnectException: Connection refused,2015-01-23T09:43:47.680Z)]
[INFO] [01/23/2015 09:43:49.700] [marathon-akka.actor.default-dispatcher-18] [akka://marathon/user/$r] Received health result: [Unhealthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,ConnectException: Connection refused,2015-01-23T09:43:49.700Z)]
[INFO] [01/23/2015 09:43:51.720] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/$r] Received health result: [Unhealthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,ConnectException: Connection refused,2015-01-23T09:43:51.720Z)]
[INFO] [01/23/2015 09:43:53.742] [marathon-akka.actor.default-dispatcher-23] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:53.741Z)]
[INFO] [01/23/2015 09:43:53.742] [marathon-akka.actor.default-dispatcher-2] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/f20d2493-45ab-44e3-8ba1-2c5d1437503e/$a] Killing old task topface_prod-test_app.e0b75ff8-a2e2-11e4-b8c3-56847afe9799 because topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799 became reachable
[INFO] [01/23/2015 09:43:55.762] [marathon-akka.actor.default-dispatcher-4] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:55.761Z)]
[INFO] [01/23/2015 09:43:57.780] [marathon-akka.actor.default-dispatcher-21] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:57.780Z)]
[INFO] [01/23/2015 09:43:59.800] [marathon-akka.actor.default-dispatcher-12] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:59.800Z)]

The last started task in this app was actually started at ~13:47. It looks like health checks are not saved until all tasks are running. In UI health check bullet is gray before each task is running in current revision.

The text was updated successfully, but these errors were encountered:

drexin · 2015-01-23T10:27:13Z

Looks like there has been a failover in the meantime. Health check results are currently not being persisted.

bobrik · 2015-01-23T10:37:15Z

Looks like first node was a leader for the whole time.

web489 ~ # docker logs 5a4ef2701bae 2>/dev/null | fgrep '2015-01-23 09:' | fgrep Proxying | sed 's/.*] //g' | uniq -c
   3619 INFO Proxying request to leader at web488:8080 (mesosphere.marathon.api.LeaderProxyFilter:62)

drexin · 2015-01-23T10:43:14Z

That's pretty strange. The relevant code is here:

https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/health/HealthCheckActor.scala#L135

and here

https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/health/Health.scala

So the only possibility for the date of the first health check to change is when the health check actor is being restarted. That should only happen on failover. Could you please search the logs for leader election entries?

bobrik · 2015-01-23T10:46:03Z

web488 ~ # docker logs 70b638522bfa 2>/dev/null | fgrep leader | fgrep -v Proxying
[2015-01-23 09:11:55,106] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 09:11:55,106] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 09:11:55,139] INFO Candidate /marathon/leader/member_0000000339 waiting for the next leader election, current voting: [member_0000000339, member_0000000332] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 09:12:15,206] INFO Candidate /marathon/leader/member_0000000339 is now leader of group: [member_0000000339] (com.twitter.common.zookeeper.CandidateImpl:152)
[2015-01-23 09:12:15,219] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService:284)
[2015-01-23 10:32:43,741] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 10:32:43,741] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 10:32:43,885] INFO Candidate /marathon/leader/member_0000000342 waiting for the next leader election, current voting: [member_0000000342, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)

web489 ~ # docker logs 5a4ef2701bae 2>/dev/null | fgrep leader | fgrep -v Proxying
[2015-01-23 09:12:21,597] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 09:12:21,597] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 09:12:21,677] INFO Candidate /marathon/leader/member_0000000340 waiting for the next leader election, current voting: [member_0000000339, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 09:12:44,549] INFO Candidate /marathon/leader/member_0000000340 waiting for the next leader election, current voting: [member_0000000339, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 10:32:42,002] INFO Candidate /marathon/leader/member_0000000340 is now leader of group: [member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:152)
[2015-01-23 10:32:42,421] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService:284)

web490 ~ # docker logs a221db669a36 2>/dev/null | fgrep leader | fgrep -v Proxying
[2015-01-23 09:12:44,538] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 09:12:44,538] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 09:12:44,563] INFO Candidate /marathon/leader/member_0000000341 waiting for the next leader election, current voting: [member_0000000339, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 10:32:42,014] INFO Candidate /marathon/leader/member_0000000341 waiting for the next leader election, current voting: [member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 10:32:43,777] INFO Candidate /marathon/leader/member_0000000341 waiting for the next leader election, current voting: [member_0000000342, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)

drexin · 2015-01-23T11:00:43Z

Is this on 0.7.6 or 0.8.0-RC1?

bobrik · 2015-01-23T11:07:35Z

This is master, f7f9a00 as HEAD.

bobrik · 2015-01-23T11:36:16Z

Same app, last task:

First task:

drexin · 2015-01-23T11:41:21Z

Okay, I have been able to reproduce this.

bobrik · 2015-01-26T20:05:59Z

Is it the same issue that prevents health checks from appearing for tasks after scaling?

I'm seeing this:

sttts · 2015-01-26T21:47:03Z

@bobrik will those turn green eventually? Or do they stay like that forever?

bobrik · 2015-01-26T21:48:27Z

It turned all green when I checked in an hour.

sttts · 2015-01-26T21:49:01Z

In an hour? that's long :) With the usual health check intervals?

bobrik · 2015-01-26T21:51:41Z

It might've happened faster, I just checked in an hour. I can probably check logs for exact times tomorrow.

Health checks for this app happen every 2 seconds, deployment is very fast too: 20mb docker container.

sttts · 2015-01-26T22:02:53Z

Don't worry. Will try to reproduce this.

sttts · 2015-01-26T22:23:07Z

@bobrik Which kind of health check was it? HTTP? Command? Can you reproduce it with a mesos containerizer?

sttts · 2015-01-26T22:51:21Z

Tried HTTP and COMMAND, cannot reproduce.

bobrik · 2015-01-27T07:24:42Z

Reproduced that with initial deployment. Health checks are TCP.

No health checks performed:

{"log":"[2015-01-27 07:17:44,329] INFO 192.168.1.233 -  -  [27/Jan/2015:07:17:44 +0000] \"GET /v2/apps/topface/test/wtf HTTP/1.1\" 200 1425 \"-\" \"curl/7.26.0\" (mesosphere.chaos.http.ChaosRequestLog:15)\n","stream":"stdout","time":"2015-01-27T07:17:44.329328196Z"}

{
  "app": {
    "id": "/topface/test/wtf",
    "cmd": "/app -listen 0.0.0.0:$PORT",
    "args": null,
    "user": null,
    "env": {},
    "instances": 2,
    "cpus": 0.2,
    "mem": 64,
    "disk": 0,
    "executor": "",
    "constraints": [
      [
        "hostname",
        "UNIQUE"
      ]
    ],
    "uris": [],
    "storeUrls": [],
    "ports": [
      18001
    ],
    "requirePorts": false,
    "backoffSeconds": 5,
    "backoffFactor": 1,
    "maxLaunchDelaySeconds": 3600,
    "container": {
      "type": "DOCKER",
      "volumes": [
        {
          "containerPath": "/etc/ssl/certs",
          "hostPath": "/etc/ssl/certs",
          "mode": "RO"
        }
      ],
      "docker": {
        "image": "docker.core.tf/scruffy:2015-01-18.1",
        "privileged": false,
        "parameters": []
      }
    },
    "healthChecks": [
      {
        "path": "/",
        "protocol": "TCP",
        "portIndex": 0,
        "gracePeriodSeconds": 15,
        "intervalSeconds": 2,
        "timeoutSeconds": 5,
        "maxConsecutiveFailures": 3
      }
    ],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 1,
      "maximumOverCapacity": 1
    },
    "labels": {},
    "version": "2015-01-27T07:13:40.196Z",
    "tasksStaged": 0,
    "tasksRunning": 2,
    "tasksHealthy": 0,
    "tasksUnhealthy": 0,
    "deployments": [],
    "tasks": [
      {
        "id": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799",
        "host": "web323",
        "ports": [
          31650
        ],
        "startedAt": "2015-01-27T07:13:45.488Z",
        "stagedAt": "2015-01-27T07:13:41.684Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf"
      },
      {
        "id": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799",
        "host": "web491",
        "ports": [
          31778
        ],
        "startedAt": "2015-01-27T07:13:45.516Z",
        "stagedAt": "2015-01-27T07:13:41.618Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf"
      }
    ],
    "lastTaskFailure": null
  }
}

Here they are:

{"log":"[2015-01-27 07:18:10,393] INFO 192.168.1.233 -  -  [27/Jan/2015:07:18:10 +0000] \"GET /v2/apps/topface/test/wtf HTTP/1.1\" 200 1883 \"-\" \"curl/7.26.0\" (mesosphere.chaos.http.ChaosRequestLog:15)\n","stream":"stdout","time":"2015-01-27T07:18:10.39404871Z"}

{
  "app": {
    "id": "/topface/test/wtf",
    "cmd": "/app -listen 0.0.0.0:$PORT",
    "args": null,
    "user": null,
    "env": {},
    "instances": 2,
    "cpus": 0.2,
    "mem": 64,
    "disk": 0,
    "executor": "",
    "constraints": [
      [
        "hostname",
        "UNIQUE"
      ]
    ],
    "uris": [],
    "storeUrls": [],
    "ports": [
      18001
    ],
    "requirePorts": false,
    "backoffSeconds": 5,
    "backoffFactor": 1,
    "maxLaunchDelaySeconds": 3600,
    "container": {
      "type": "DOCKER",
      "volumes": [
        {
          "containerPath": "/etc/ssl/certs",
          "hostPath": "/etc/ssl/certs",
          "mode": "RO"
        }
      ],
      "docker": {
        "image": "docker.core.tf/scruffy:2015-01-18.1",
        "privileged": false,
        "parameters": []
      }
    },
    "healthChecks": [
      {
        "path": "/",
        "protocol": "TCP",
        "portIndex": 0,
        "gracePeriodSeconds": 15,
        "intervalSeconds": 2,
        "timeoutSeconds": 5,
        "maxConsecutiveFailures": 3
      }
    ],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 1,
      "maximumOverCapacity": 1
    },
    "labels": {},
    "version": "2015-01-27T07:13:40.196Z",
    "tasksStaged": 0,
    "tasksRunning": 2,
    "tasksHealthy": 2,
    "tasksUnhealthy": 0,
    "deployments": [],
    "tasks": [
      {
        "id": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799",
        "host": "web323",
        "ports": [
          31650
        ],
        "startedAt": "2015-01-27T07:13:45.488Z",
        "stagedAt": "2015-01-27T07:13:41.684Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf",
        "healthCheckResults": [
          {
            "alive": true,
            "consecutiveFailures": 0,
            "firstSuccess": "2015-01-27T07:17:53.424Z",
            "lastFailure": null,
            "lastSuccess": "2015-01-27T07:18:09.585Z",
            "taskId": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799"
          }
        ]
      },
      {
        "id": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799",
        "host": "web491",
        "ports": [
          31778
        ],
        "startedAt": "2015-01-27T07:13:45.516Z",
        "stagedAt": "2015-01-27T07:13:41.618Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf",
        "healthCheckResults": [
          {
            "alive": true,
            "consecutiveFailures": 0,
            "firstSuccess": "2015-01-27T07:17:53.424Z",
            "lastFailure": null,
            "lastSuccess": "2015-01-27T07:18:09.585Z",
            "taskId": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799"
          }
        ]
      }
    ],
    "lastTaskFailure": null
  }
}

Deployment was finished a while ago, but health checks didn't appear.

Health checks in marathon log started much sooner.

{"log":"[INFO] [01/27/2015 07:13:46.912] [marathon-akka.actor.default-dispatcher-24] [akka://marathon/user/$k] Received health result: [Healthy(topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799,2015-01-27T07:13:40.196Z,2015-01-27T07:13:46.912Z)]\n","stream":"stdout","time":"2015-01-27T07:13:46.912698279Z"}
{"log":"[INFO] [01/27/2015 07:13:46.912] [marathon-akka.actor.default-dispatcher-24] [akka://marathon/user/$k] Received health result: [Healthy(topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799,2015-01-27T07:13:40.196Z,2015-01-27T07:13:46.912Z)]\n","stream":"stdout","time":"2015-01-27T07:13:46.912767613Z"}

I'm using master with your PR merged for bars in UI.

c089 · 2015-01-27T08:56:07Z

I'm seeing similar issues after scaling up: Immediately after starting a new task, the JSON returned by the API does not include a healthCheckResults property at all. Then, after about 30s I can see the first health check appear in the log of the started service, and the marathon API returns a singleton array with the null value (i.e. [ null ]). Then nothing happens for a while (except for new health checks every 30s) and it sometimes takes several minutes for the actual health check results to be returned by the API...

sttts · 2015-01-27T08:56:44Z

@c089 How does your app json look like?

sttts · 2015-01-27T08:58:47Z

I have replicated @bobrik's app, docker as containerizer, TCP health check with the same values. But I neither see the delayed green in the health-bar, nor wrong first success values. @bobrik Would you mind to upload complete logs as a gist?

c089 · 2015-01-27T08:58:47Z

The complete one reported by the API or the one I used to create the app with?

sttts · 2015-01-27T09:02:37Z

The one to create the app.

c089 · 2015-01-27T09:08:57Z

Mostly using defaults stripped it down to the minimum I needed to deploy and I'm also using docker containers. Generated from this JS object:

{
    id: 'hotel-index-mvp',
    container: {
        type: 'DOCKER',
        docker: {
            image: imageName,
            network: 'BRIDGE',
            portMappings: [ { containerPort: 3000 } ]
        }
    },
    env: {},
    healthChecks: [ { path: '/hi/_health' } ]
};

bobrik · 2015-01-27T09:35:57Z

Weird, this happens only in production cluster. I tried the same marathon image with different data path in zk and health checks appear immediately.

Here are logs from prodution master (grepped for wtf): https://gist.github.com/bobrik/dd2fdbf49284dd189e19

Interesting stuff starts at 2015-01-27 09:33:17,598. App json included into gist.

sttts · 2015-01-27T09:55:14Z

The interesting line is

{"log":"[2015-01-27 09:33:17,609] INFO Adding health check for app

Why does Marathon start another health check actor? It was started at the top already.

bobrik · 2015-02-04T09:46:46Z

I've seen this without any failover, even on 1-node cluster.

bobrik · 2015-02-04T10:25:02Z

And I'm still seeing this on 3-node production cluster. I just updated to current master and the issue is still there. I tried deploying an app with different paths: /topface/prod-test/app, /topface/hehehe/app, /hell/no — still having the same issue.

I tried different app on /totally/different/path and it is broken too:

[2015-02-04 10:14:15,562] INFO Starting app /totally/different/path (mesosphere.marathon.SchedulerActions:363)
[2015-02-04 10:14:15,566] INFO Already running 0 instances of /totally/different/path. Not scaling. (mesosphere.marathon.SchedulerActions:512)
[2015-02-04 10:14:16,278] INFO Received status update for task totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[INFO] [02/04/2015 10:14:17.565] [marathon-akka.actor.default-dispatcher-22] [akka://marathon/user/$V] Received health result: [Healthy(totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799,2015-02-04T10:14:15.433Z,2015-02-04T10:14:17.565Z)]
[INFO] [02/04/2015 10:14:17.565] [marathon-akka.actor.default-dispatcher-9] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/64b174e1-d279-453b-920b-1e7559bf4c1a/$b] totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799 is now healthy
[2015-02-04 10:14:18,926] INFO Received status update for task totally_different_path.92b296f3-ac56-11e4-bc81-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-4] [akka://marathon/user/$V] Received health result: [Healthy(totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799,2015-02-04T10:14:15.433Z,2015-02-04T10:14:19.584Z)]
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-3] [akka://marathon/user/$V] Received health result: [Healthy(totally_different_path.92b296f3-ac56-11e4-bc81-56847afe9799,2015-02-04T10:14:15.433Z,2015-02-04T10:14:19.584Z)]
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/64b174e1-d279-453b-920b-1e7559bf4c1a/$b] totally_different_path.92b296f3-ac56-11e4-bc81-56847afe9799 is now healthy
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/64b174e1-d279-453b-920b-1e7559bf4c1a/$b] Successfully started 2 instances of /totally/different/path

Looks like state of my cluster got broken somehow. I think it's better to figure out what went wrong instead of just nuking current state with hope that it won't happen again.

I can add more logging to provide you with more info. Getting zk dump is also possible.

bobrik · 2015-02-11T09:25:59Z

I launched marathon with new zk path, deployed everything except for my main app and health checks are delayed (deployment finished at ~13:20:28):

bobrik · 2015-02-11T11:56:34Z

Well, at least deployment waits for new tasks to become healthy. This issue should be open anyway.

bobrik · 2015-02-18T13:20:30Z

Interesting observation: when firstSuccess appears in API and in UI, health checks are actually doubled. Time in UI is 2/18/2015, 4:15:37 PM, relevant piece of log:

192.168.1.234 - - [18/Feb/2015:16:15:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:37 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:42 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:47 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"

More context:

192.168.1.234 - - [18/Feb/2015:16:11:05 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:08 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:10 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:13 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:13 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:18 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:23 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:28 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:33 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:38 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:43 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:48 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:53 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:58 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:03 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:08 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:13 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:18 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:23 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:28 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:33 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:38 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:43 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:48 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:53 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:58 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:03 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:08 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:37 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:42 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:47 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:52 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:57 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:02 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:07 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:12 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:17 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:22 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:27 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:32 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:37 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:42 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:47 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:52 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:57 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:02 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:07 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:12 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:17 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:22 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:25 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:28 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:30 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:33 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:35 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:38 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:40 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:43 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:45 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:48 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:50 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:53 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:55 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:58 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:18:00 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:18:03 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"

@drexin, @ConnorDoyle please reopen the issue.

sttts · 2015-02-18T13:32:12Z

What's the marathon log at the same moment?

bobrik · 2015-02-18T13:54:49Z

https://gist.github.com/bobrik/dcca1f2c375f7d1399ea I removed http logs and unrelevant apps.

sttts · 2015-02-18T16:44:14Z

Look at these lines:

[2015-02-18 13:15:32,556] INFO Removing health check for app [/topface/prod-test/app] and version [2015-02-18T12:59:30.198Z]: [HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3)] (mesosphere.marathon.health.MarathonHealthCheckManager:91)
[2015-02-18 13:15:32,557] INFO Removing health check for app [/topface/prod-test/app] and version [2015-02-18T12:59:30.198Z]: [HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3)] (mesosphere.marathon.health.MarathonHealthCheckManager:91)

This means that both ActiveHealthChecks are registered in the appHealthChecks map. Very strange.

In order to get a better understanding, could you try to reproduce the problem with this patch? https://gist.github.com/anonymous/f67894498e5df2cababd

bobrik · 2015-02-18T17:43:03Z

Sure, here is the log: https://gist.github.com/bobrik/fbf5b36309e5899530df

Lines are filtered like before.

sttts · 2015-02-18T17:59:06Z

Look at the appHealthChecks map after the second health check was started:

Map(
  AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) ->
    Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,
           10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), 
  AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> 
    Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,
           10 seconds,3),Actor[akka://marathon/user/$k#186143323])))

We have visually the same key. If we assume that the Scala map is not broken, the equality of AppVersion seems to give something we don't expect.

bobrik · 2015-02-18T18:15:08Z

Huh, I deployed several times and health checks accumulated on each deploy:

[2015-02-18 17:38:33,656] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 17:38:33,660] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
2015-02-18 18:01:09,837:1(0x7f80f3fff700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 20ms
[2015-02-18 18:08:26,157] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:08:33,780] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:11:32,628] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:11:32.442Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$q#1857576886])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:11:51,836] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:11:32.442Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$q#1857576886])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:12:24,372] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:11:32.442Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$q#1857576886])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:13:33,687] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,688] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,689] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,690] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,691] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,692] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,701] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$v#1772774775])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)

Timestamp's hashCode was based on the parameter dataTime which might have a timezone != UTC. The consequence: val t = Timestamp.now t.hashCode != Timestamp(t.toString).hashCode This patch overrides hashCode to use the normalize time variable of the Timestamp case class. This probably fixes d2iq-archive#1082

Timestamp's hashCode was based on the parameter dataTime which might have a timezone != UTC. The consequence: val t = Timestamp.now t.hashCode != Timestamp(t.toString).hashCode This patch overrides hashCode to use the UTC normalized time variable of the Timestamp case class. This probably fixes d2iq-archive#1082

sttts · 2015-02-18T19:27:00Z

@bobrik Please test the upper patch. I think this fixes your issue.

Timestamp's hashCode was based on the parameter dataTime which might have a timezone != UTC. The consequence: val t = Timestamp.now t.hashCode != Timestamp(t.toString).hashCode This patch overrides hashCode to use the UTC normalized time variable of the Timestamp case class. This probably fixes d2iq-archive#1082

bobrik · 2015-02-18T19:43:04Z

I'm running marathon in UTC, but I'll give it a try.

sttts · 2015-02-18T19:52:18Z

I think it's about the chronology parameter in general inside of DateTime. There might be other chronologies which are actually equivalent to UTC, but still giving a different hashCode. Let's see, only a theory for now.

In my testing here, I did this:

val t = Timestamp.now
t.hashCode
Timestamp(t.toString).hashCode

The two hash codes differed.

bobrik · 2015-02-18T19:55:14Z

OMG IT WORKED! Beers are on me next time I meet you in person 🍺

sttts · 2015-02-18T20:02:02Z

:-)

ConnorDoyle · 2015-02-18T20:02:58Z

Nice find @sttts

sttts · 2015-02-18T20:03:04Z

Wondering how many other strange behaviors were triggered by this bug.... a map which isn't a map.

bobrik · 2015-02-18T20:04:05Z

I wonder why it worked in the first place, with 1-2 apps it works pretty well.

bobrik · 2015-02-18T20:08:56Z

Is is because appHealthChecks hash rebuilds (more buckets, app versions in different buckets) when more apps are introduced?

sttts · 2015-02-18T20:12:44Z

Yes, can be. I guess Scala's Map implementation uses equal in the same bucket. And equal was correct. So Timestamps sometimes were identified, sometimes they were not.

drexin added reproduced labels Jan 23, 2015

drexin added this to the 0.8.1 milestone Jan 23, 2015

bobrik mentioned this issue Jan 26, 2015

Health checks for deployment #1099

Closed

sttts mentioned this issue Jan 27, 2015

Health check result array [null] returned in embedded tasks sometimes #1106

Closed

drexin closed this as completed Jan 30, 2015

sttts mentioned this issue Feb 18, 2015

Override Timestamp hashCode for consistent hashing #1218

Merged

bobrik mentioned this issue Feb 26, 2015

Removal of env variables does not trigger deployment #1238

Closed

sttts mentioned this issue Mar 9, 2015

Health status not reflected in REST API until some time after start #1270

Closed

bobrik mentioned this issue May 20, 2015

CPU usage increased dramatically 0.8.1-RC1 -> master #1497

Closed

d2iq-archive locked and limited conversation to collaborators Mar 27, 2017

firstSuccess in health checks is incorrect #1082

firstSuccess in health checks is incorrect #1082

Comments

bobrik commented Jan 23, 2015

drexin commented Jan 23, 2015

bobrik commented Jan 23, 2015

drexin commented Jan 23, 2015

bobrik commented Jan 23, 2015

drexin commented Jan 23, 2015

bobrik commented Jan 23, 2015

bobrik commented Jan 23, 2015

drexin commented Jan 23, 2015

bobrik commented Jan 26, 2015

sttts commented Jan 26, 2015

bobrik commented Jan 26, 2015

sttts commented Jan 26, 2015

bobrik commented Jan 26, 2015

sttts commented Jan 26, 2015

sttts commented Jan 26, 2015

sttts commented Jan 26, 2015

bobrik commented Jan 27, 2015

c089 commented Jan 27, 2015

sttts commented Jan 27, 2015

sttts commented Jan 27, 2015

c089 commented Jan 27, 2015

sttts commented Jan 27, 2015

c089 commented Jan 27, 2015

bobrik commented Jan 27, 2015

sttts commented Jan 27, 2015

bobrik commented Feb 4, 2015

bobrik commented Feb 4, 2015

bobrik commented Feb 11, 2015

bobrik commented Feb 11, 2015

bobrik commented Feb 18, 2015

sttts commented Feb 18, 2015

bobrik commented Feb 18, 2015

sttts commented Feb 18, 2015

bobrik commented Feb 18, 2015

sttts commented Feb 18, 2015

bobrik commented Feb 18, 2015

sttts commented Feb 18, 2015

bobrik commented Feb 18, 2015

sttts commented Feb 18, 2015

bobrik commented Feb 18, 2015

sttts commented Feb 18, 2015

ConnorDoyle commented Feb 18, 2015

sttts commented Feb 18, 2015

bobrik commented Feb 18, 2015

bobrik commented Feb 18, 2015

sttts commented Feb 18, 2015