Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

firstSuccess in health checks is incorrect #1082

Closed
bobrik opened this issue Jan 23, 2015 · 72 comments · Fixed by #1218
Closed

firstSuccess in health checks is incorrect #1082

bobrik opened this issue Jan 23, 2015 · 72 comments · Fixed by #1218
Milestone

Comments

@bobrik
Copy link
Contributor

bobrik commented Jan 23, 2015

I'm running master, as usual.

First success for task in ui is reported at 13:47:32 (UTC+4 #1038):

screen shot 2015-01-23 at 12 51 53

But in logs first success happened much faster:

[2015-01-23 09:43:45,927] INFO Received status update for task topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[INFO] [01/23/2015 09:43:47.680] [marathon-akka.actor.default-dispatcher-19] [akka://marathon/user/$r] Received health result: [Unhealthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,ConnectException: Connection refused,2015-01-23T09:43:47.680Z)]
[INFO] [01/23/2015 09:43:49.700] [marathon-akka.actor.default-dispatcher-18] [akka://marathon/user/$r] Received health result: [Unhealthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,ConnectException: Connection refused,2015-01-23T09:43:49.700Z)]
[INFO] [01/23/2015 09:43:51.720] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/$r] Received health result: [Unhealthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,ConnectException: Connection refused,2015-01-23T09:43:51.720Z)]
[INFO] [01/23/2015 09:43:53.742] [marathon-akka.actor.default-dispatcher-23] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:53.741Z)]
[INFO] [01/23/2015 09:43:53.742] [marathon-akka.actor.default-dispatcher-2] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/f20d2493-45ab-44e3-8ba1-2c5d1437503e/$a] Killing old task topface_prod-test_app.e0b75ff8-a2e2-11e4-b8c3-56847afe9799 because topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799 became reachable
[INFO] [01/23/2015 09:43:55.762] [marathon-akka.actor.default-dispatcher-4] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:55.761Z)]
[INFO] [01/23/2015 09:43:57.780] [marathon-akka.actor.default-dispatcher-21] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:57.780Z)]
[INFO] [01/23/2015 09:43:59.800] [marathon-akka.actor.default-dispatcher-12] [akka://marathon/user/$r] Received health result: [Healthy(topface_prod-test_app.352fb707-a2e4-11e4-b8c3-56847afe9799,2015-01-23T09:42:55.010Z,2015-01-23T09:43:59.800Z)]

The last started task in this app was actually started at ~13:47. It looks like health checks are not saved until all tasks are running. In UI health check bullet is gray before each task is running in current revision.

@drexin
Copy link
Contributor

drexin commented Jan 23, 2015

Looks like there has been a failover in the meantime. Health check results are currently not being persisted.

@bobrik
Copy link
Contributor Author

bobrik commented Jan 23, 2015

Looks like first node was a leader for the whole time.

web489 ~ # docker logs 5a4ef2701bae 2>/dev/null | fgrep '2015-01-23 09:' | fgrep Proxying | sed 's/.*] //g' | uniq -c
   3619 INFO Proxying request to leader at web488:8080 (mesosphere.marathon.api.LeaderProxyFilter:62)

@drexin
Copy link
Contributor

drexin commented Jan 23, 2015

That's pretty strange. The relevant code is here:

https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/health/HealthCheckActor.scala#L135

and here

https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/health/Health.scala

So the only possibility for the date of the first health check to change is when the health check actor is being restarted. That should only happen on failover. Could you please search the logs for leader election entries?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 23, 2015

web488 ~ # docker logs 70b638522bfa 2>/dev/null | fgrep leader | fgrep -v Proxying
[2015-01-23 09:11:55,106] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 09:11:55,106] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 09:11:55,139] INFO Candidate /marathon/leader/member_0000000339 waiting for the next leader election, current voting: [member_0000000339, member_0000000332] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 09:12:15,206] INFO Candidate /marathon/leader/member_0000000339 is now leader of group: [member_0000000339] (com.twitter.common.zookeeper.CandidateImpl:152)
[2015-01-23 09:12:15,219] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService:284)
[2015-01-23 10:32:43,741] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 10:32:43,741] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 10:32:43,885] INFO Candidate /marathon/leader/member_0000000342 waiting for the next leader election, current voting: [member_0000000342, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
web489 ~ # docker logs 5a4ef2701bae 2>/dev/null | fgrep leader | fgrep -v Proxying
[2015-01-23 09:12:21,597] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 09:12:21,597] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 09:12:21,677] INFO Candidate /marathon/leader/member_0000000340 waiting for the next leader election, current voting: [member_0000000339, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 09:12:44,549] INFO Candidate /marathon/leader/member_0000000340 waiting for the next leader election, current voting: [member_0000000339, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 10:32:42,002] INFO Candidate /marathon/leader/member_0000000340 is now leader of group: [member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:152)
[2015-01-23 10:32:42,421] INFO Elect leadership (mesosphere.marathon.MarathonSchedulerService:284)
web490 ~ # docker logs a221db669a36 2>/dev/null | fgrep leader | fgrep -v Proxying
[2015-01-23 09:12:44,538] INFO Offering leadership (mesosphere.marathon.MarathonSchedulerService:304)
[2015-01-23 09:12:44,538] INFO Using HA and therefore offering leadership (mesosphere.marathon.MarathonSchedulerService:311)
[2015-01-23 09:12:44,563] INFO Candidate /marathon/leader/member_0000000341 waiting for the next leader election, current voting: [member_0000000339, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 10:32:42,014] INFO Candidate /marathon/leader/member_0000000341 waiting for the next leader election, current voting: [member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)
[2015-01-23 10:32:43,777] INFO Candidate /marathon/leader/member_0000000341 waiting for the next leader election, current voting: [member_0000000342, member_0000000341, member_0000000340] (com.twitter.common.zookeeper.CandidateImpl:165)

@drexin
Copy link
Contributor

drexin commented Jan 23, 2015

Is this on 0.7.6 or 0.8.0-RC1?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 23, 2015

This is master, f7f9a00 as HEAD.

@bobrik
Copy link
Contributor Author

bobrik commented Jan 23, 2015

Same app, last task:

screen shot 2015-01-23 at 14 35 17

First task:

screen shot 2015-01-23 at 14 35 29

@drexin
Copy link
Contributor

drexin commented Jan 23, 2015

Okay, I have been able to reproduce this.

@drexin drexin added this to the 0.8.1 milestone Jan 23, 2015
@bobrik
Copy link
Contributor Author

bobrik commented Jan 26, 2015

Is it the same issue that prevents health checks from appearing for tasks after scaling?

I'm seeing this:

screen shot 2015-01-26 at 23 04 40

@sttts
Copy link
Contributor

sttts commented Jan 26, 2015

@bobrik will those turn green eventually? Or do they stay like that forever?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 26, 2015

It turned all green when I checked in an hour.

@sttts
Copy link
Contributor

sttts commented Jan 26, 2015

In an hour? that's long :) With the usual health check intervals?

@bobrik
Copy link
Contributor Author

bobrik commented Jan 26, 2015

It might've happened faster, I just checked in an hour. I can probably check logs for exact times tomorrow.

Health checks for this app happen every 2 seconds, deployment is very fast too: 20mb docker container.

@sttts
Copy link
Contributor

sttts commented Jan 26, 2015

Don't worry. Will try to reproduce this.

@sttts
Copy link
Contributor

sttts commented Jan 26, 2015

@bobrik Which kind of health check was it? HTTP? Command? Can you reproduce it with a mesos containerizer?

@sttts
Copy link
Contributor

sttts commented Jan 26, 2015

Tried HTTP and COMMAND, cannot reproduce.

@bobrik
Copy link
Contributor Author

bobrik commented Jan 27, 2015

Reproduced that with initial deployment. Health checks are TCP.

No health checks performed:

{"log":"[2015-01-27 07:17:44,329] INFO 192.168.1.233 -  -  [27/Jan/2015:07:17:44 +0000] \"GET /v2/apps/topface/test/wtf HTTP/1.1\" 200 1425 \"-\" \"curl/7.26.0\" (mesosphere.chaos.http.ChaosRequestLog:15)\n","stream":"stdout","time":"2015-01-27T07:17:44.329328196Z"}
{
  "app": {
    "id": "/topface/test/wtf",
    "cmd": "/app -listen 0.0.0.0:$PORT",
    "args": null,
    "user": null,
    "env": {},
    "instances": 2,
    "cpus": 0.2,
    "mem": 64,
    "disk": 0,
    "executor": "",
    "constraints": [
      [
        "hostname",
        "UNIQUE"
      ]
    ],
    "uris": [],
    "storeUrls": [],
    "ports": [
      18001
    ],
    "requirePorts": false,
    "backoffSeconds": 5,
    "backoffFactor": 1,
    "maxLaunchDelaySeconds": 3600,
    "container": {
      "type": "DOCKER",
      "volumes": [
        {
          "containerPath": "/etc/ssl/certs",
          "hostPath": "/etc/ssl/certs",
          "mode": "RO"
        }
      ],
      "docker": {
        "image": "docker.core.tf/scruffy:2015-01-18.1",
        "privileged": false,
        "parameters": []
      }
    },
    "healthChecks": [
      {
        "path": "/",
        "protocol": "TCP",
        "portIndex": 0,
        "gracePeriodSeconds": 15,
        "intervalSeconds": 2,
        "timeoutSeconds": 5,
        "maxConsecutiveFailures": 3
      }
    ],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 1,
      "maximumOverCapacity": 1
    },
    "labels": {},
    "version": "2015-01-27T07:13:40.196Z",
    "tasksStaged": 0,
    "tasksRunning": 2,
    "tasksHealthy": 0,
    "tasksUnhealthy": 0,
    "deployments": [],
    "tasks": [
      {
        "id": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799",
        "host": "web323",
        "ports": [
          31650
        ],
        "startedAt": "2015-01-27T07:13:45.488Z",
        "stagedAt": "2015-01-27T07:13:41.684Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf"
      },
      {
        "id": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799",
        "host": "web491",
        "ports": [
          31778
        ],
        "startedAt": "2015-01-27T07:13:45.516Z",
        "stagedAt": "2015-01-27T07:13:41.618Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf"
      }
    ],
    "lastTaskFailure": null
  }
}

Here they are:

{"log":"[2015-01-27 07:18:10,393] INFO 192.168.1.233 -  -  [27/Jan/2015:07:18:10 +0000] \"GET /v2/apps/topface/test/wtf HTTP/1.1\" 200 1883 \"-\" \"curl/7.26.0\" (mesosphere.chaos.http.ChaosRequestLog:15)\n","stream":"stdout","time":"2015-01-27T07:18:10.39404871Z"}
{
  "app": {
    "id": "/topface/test/wtf",
    "cmd": "/app -listen 0.0.0.0:$PORT",
    "args": null,
    "user": null,
    "env": {},
    "instances": 2,
    "cpus": 0.2,
    "mem": 64,
    "disk": 0,
    "executor": "",
    "constraints": [
      [
        "hostname",
        "UNIQUE"
      ]
    ],
    "uris": [],
    "storeUrls": [],
    "ports": [
      18001
    ],
    "requirePorts": false,
    "backoffSeconds": 5,
    "backoffFactor": 1,
    "maxLaunchDelaySeconds": 3600,
    "container": {
      "type": "DOCKER",
      "volumes": [
        {
          "containerPath": "/etc/ssl/certs",
          "hostPath": "/etc/ssl/certs",
          "mode": "RO"
        }
      ],
      "docker": {
        "image": "docker.core.tf/scruffy:2015-01-18.1",
        "privileged": false,
        "parameters": []
      }
    },
    "healthChecks": [
      {
        "path": "/",
        "protocol": "TCP",
        "portIndex": 0,
        "gracePeriodSeconds": 15,
        "intervalSeconds": 2,
        "timeoutSeconds": 5,
        "maxConsecutiveFailures": 3
      }
    ],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 1,
      "maximumOverCapacity": 1
    },
    "labels": {},
    "version": "2015-01-27T07:13:40.196Z",
    "tasksStaged": 0,
    "tasksRunning": 2,
    "tasksHealthy": 2,
    "tasksUnhealthy": 0,
    "deployments": [],
    "tasks": [
      {
        "id": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799",
        "host": "web323",
        "ports": [
          31650
        ],
        "startedAt": "2015-01-27T07:13:45.488Z",
        "stagedAt": "2015-01-27T07:13:41.684Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf",
        "healthCheckResults": [
          {
            "alive": true,
            "consecutiveFailures": 0,
            "firstSuccess": "2015-01-27T07:17:53.424Z",
            "lastFailure": null,
            "lastSuccess": "2015-01-27T07:18:09.585Z",
            "taskId": "topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799"
          }
        ]
      },
      {
        "id": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799",
        "host": "web491",
        "ports": [
          31778
        ],
        "startedAt": "2015-01-27T07:13:45.516Z",
        "stagedAt": "2015-01-27T07:13:41.618Z",
        "version": "2015-01-27T07:13:40.196Z",
        "appId": "/topface/test/wtf",
        "healthCheckResults": [
          {
            "alive": true,
            "consecutiveFailures": 0,
            "firstSuccess": "2015-01-27T07:17:53.424Z",
            "lastFailure": null,
            "lastSuccess": "2015-01-27T07:18:09.585Z",
            "taskId": "topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799"
          }
        ]
      }
    ],
    "lastTaskFailure": null
  }
}

Deployment was finished a while ago, but health checks didn't appear.

Health checks in marathon log started much sooner.

{"log":"[INFO] [01/27/2015 07:13:46.912] [marathon-akka.actor.default-dispatcher-24] [akka://marathon/user/$k] Received health result: [Healthy(topface_test_wtf.05db9fb8-a5f4-11e4-9eee-56847afe9799,2015-01-27T07:13:40.196Z,2015-01-27T07:13:46.912Z)]\n","stream":"stdout","time":"2015-01-27T07:13:46.912698279Z"}
{"log":"[INFO] [01/27/2015 07:13:46.912] [marathon-akka.actor.default-dispatcher-24] [akka://marathon/user/$k] Received health result: [Healthy(topface_test_wtf.05cf6ab7-a5f4-11e4-9eee-56847afe9799,2015-01-27T07:13:40.196Z,2015-01-27T07:13:46.912Z)]\n","stream":"stdout","time":"2015-01-27T07:13:46.912767613Z"}

I'm using master with your PR merged for bars in UI.

@c089
Copy link
Contributor

c089 commented Jan 27, 2015

I'm seeing similar issues after scaling up: Immediately after starting a new task, the JSON returned by the API does not include a healthCheckResults property at all. Then, after about 30s I can see the first health check appear in the log of the started service, and the marathon API returns a singleton array with the null value (i.e. [ null ]). Then nothing happens for a while (except for new health checks every 30s) and it sometimes takes several minutes for the actual health check results to be returned by the API...

@sttts
Copy link
Contributor

sttts commented Jan 27, 2015

@c089 How does your app json look like?

@sttts
Copy link
Contributor

sttts commented Jan 27, 2015

I have replicated @bobrik's app, docker as containerizer, TCP health check with the same values. But I neither see the delayed green in the health-bar, nor wrong first success values. @bobrik Would you mind to upload complete logs as a gist?

@c089
Copy link
Contributor

c089 commented Jan 27, 2015

The complete one reported by the API or the one I used to create the app with?

@sttts
Copy link
Contributor

sttts commented Jan 27, 2015

The one to create the app.

@c089
Copy link
Contributor

c089 commented Jan 27, 2015

Mostly using defaults stripped it down to the minimum I needed to deploy and I'm also using docker containers. Generated from this JS object:

{
    id: 'hotel-index-mvp',
    container: {
        type: 'DOCKER',
        docker: {
            image: imageName,
            network: 'BRIDGE',
            portMappings: [ { containerPort: 3000 } ]
        }
    },
    env: {},
    healthChecks: [ { path: '/hi/_health' } ]
};

@bobrik
Copy link
Contributor Author

bobrik commented Jan 27, 2015

Weird, this happens only in production cluster. I tried the same marathon image with different data path in zk and health checks appear immediately.

Here are logs from prodution master (grepped for wtf): https://gist.github.com/bobrik/dd2fdbf49284dd189e19

Interesting stuff starts at 2015-01-27 09:33:17,598. App json included into gist.

@sttts
Copy link
Contributor

sttts commented Jan 27, 2015

The interesting line is

{"log":"[2015-01-27 09:33:17,609] INFO Adding health check for app

Why does Marathon start another health check actor? It was started at the top already.

@drexin drexin closed this as completed Jan 30, 2015
@bobrik
Copy link
Contributor Author

bobrik commented Feb 4, 2015

I've seen this without any failover, even on 1-node cluster.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 4, 2015

And I'm still seeing this on 3-node production cluster. I just updated to current master and the issue is still there. I tried deploying an app with different paths: /topface/prod-test/app, /topface/hehehe/app, /hell/no — still having the same issue.

I tried different app on /totally/different/path and it is broken too:

screen shot 2015-02-04 at 13 17 49

[2015-02-04 10:14:15,562] INFO Starting app /totally/different/path (mesosphere.marathon.SchedulerActions:363)
[2015-02-04 10:14:15,566] INFO Already running 0 instances of /totally/different/path. Not scaling. (mesosphere.marathon.SchedulerActions:512)
[2015-02-04 10:14:16,278] INFO Received status update for task totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[INFO] [02/04/2015 10:14:17.565] [marathon-akka.actor.default-dispatcher-22] [akka://marathon/user/$V] Received health result: [Healthy(totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799,2015-02-04T10:14:15.433Z,2015-02-04T10:14:17.565Z)]
[INFO] [02/04/2015 10:14:17.565] [marathon-akka.actor.default-dispatcher-9] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/64b174e1-d279-453b-920b-1e7559bf4c1a/$b] totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799 is now healthy
[2015-02-04 10:14:18,926] INFO Received status update for task totally_different_path.92b296f3-ac56-11e4-bc81-56847afe9799: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler:148)
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-4] [akka://marathon/user/$V] Received health result: [Healthy(totally_different_path.92b2be04-ac56-11e4-bc81-56847afe9799,2015-02-04T10:14:15.433Z,2015-02-04T10:14:19.584Z)]
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-3] [akka://marathon/user/$V] Received health result: [Healthy(totally_different_path.92b296f3-ac56-11e4-bc81-56847afe9799,2015-02-04T10:14:15.433Z,2015-02-04T10:14:19.584Z)]
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/64b174e1-d279-453b-920b-1e7559bf4c1a/$b] totally_different_path.92b296f3-ac56-11e4-bc81-56847afe9799 is now healthy
[INFO] [02/04/2015 10:14:19.584] [marathon-akka.actor.default-dispatcher-5] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/64b174e1-d279-453b-920b-1e7559bf4c1a/$b] Successfully started 2 instances of /totally/different/path

Looks like state of my cluster got broken somehow. I think it's better to figure out what went wrong instead of just nuking current state with hope that it won't happen again.

I can add more logging to provide you with more info. Getting zk dump is also possible.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 11, 2015

I launched marathon with new zk path, deployed everything except for my main app and health checks are delayed (deployment finished at ~13:20:28):

screen shot 2015-02-11 at 12 24 36

@bobrik
Copy link
Contributor Author

bobrik commented Feb 11, 2015

Well, at least deployment waits for new tasks to become healthy. This issue should be open anyway.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

Interesting observation: when firstSuccess appears in API and in UI, health checks are actually doubled. Time in UI is 2/18/2015, 4:15:37 PM, relevant piece of log:

192.168.1.234 - - [18/Feb/2015:16:15:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:37 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:42 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:47 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"

More context:

192.168.1.234 - - [18/Feb/2015:16:11:05 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:08 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:10 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:13 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:13 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:18 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:23 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:28 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:33 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:38 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:43 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:48 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:53 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:11:58 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:03 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:08 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:13 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:18 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:23 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:28 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:33 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:38 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:43 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:48 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:53 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:12:58 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:03 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:08 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:13:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:14:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:37 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:42 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:47 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:52 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:57 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:15:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:02 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:07 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:12 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:17 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:22 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:24 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:27 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:29 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:32 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:34 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:37 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:39 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:42 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:44 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:47 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:49 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:52 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:54 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:57 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:16:59 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:02 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:04 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:07 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:09 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:12 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:14 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:17 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:19 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:22 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:25 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:28 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:30 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:33 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:35 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:38 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:40 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:43 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:45 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:48 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:50 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:53 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:55 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:17:58 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:18:00 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"
192.168.1.234 - - [18/Feb/2015:16:18:03 +0300] "GET /_health HTTP/1.1" 200 13 "-" "spray-can/1.3.2"

@drexin, @ConnorDoyle please reopen the issue.

@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

What's the marathon log at the same moment?

@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

https://gist.github.com/bobrik/dcca1f2c375f7d1399ea I removed http logs and unrelevant apps.

@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

Look at these lines:

[2015-02-18 13:15:32,556] INFO Removing health check for app [/topface/prod-test/app] and version [2015-02-18T12:59:30.198Z]: [HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3)] (mesosphere.marathon.health.MarathonHealthCheckManager:91)
[2015-02-18 13:15:32,557] INFO Removing health check for app [/topface/prod-test/app] and version [2015-02-18T12:59:30.198Z]: [HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3)] (mesosphere.marathon.health.MarathonHealthCheckManager:91)

This means that both ActiveHealthChecks are registered in the appHealthChecks map. Very strange.

In order to get a better understanding, could you try to reproduce the problem with this patch? https://gist.github.com/anonymous/f67894498e5df2cababd

@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

Sure, here is the log: https://gist.github.com/bobrik/fbf5b36309e5899530df

Lines are filtered like before.

@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

Look at the appHealthChecks map after the second health check was started:

Map(
  AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) ->
    Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,
           10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), 
  AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> 
    Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,
           10 seconds,3),Actor[akka://marathon/user/$k#186143323]))) 

We have visually the same key. If we assume that the Scala map is not broken, the equality of AppVersion seems to give something we don't expect.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

Huh, I deployed several times and health checks accumulated on each deploy:

[2015-02-18 17:38:33,656] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 17:38:33,660] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
2015-02-18 18:01:09,837:1(0x7f80f3fff700):ZOO_WARN@zookeeper_interest@1557: Exceeded deadline by 20ms
[2015-02-18 18:08:26,157] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:08:33,780] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:11:32,628] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:11:32.442Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$q#1857576886])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:11:51,836] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:11:32.442Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$q#1857576886])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:12:24,372] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:11:32.442Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$q#1857576886])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)
[2015-02-18 18:13:33,687] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$l#1882749795])), AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,688] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T17:35:48.995Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,5 seconds,10 seconds,3),Actor[akka://marathon/user/$k#186143323])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,689] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$o#1677380539])), AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,690] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:08:25.925Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$n#-10918092])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,691] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091])), AppVersion(/topface/prod-test/app,2015-02-18T18:11:51.619Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$s#1957219837]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,692] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091]))) (mesosphere.marathon.health.MarathonHealthCheckManager:109)
[2015-02-18 18:13:33,701] INFO Existing health checks for app [/topface/prod-test/app] after updating appHealthChecks: Map(AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$v#1772774775])), AppVersion(/topface/prod-test/app,2015-02-18T18:12:24.171Z) -> Set(ActiveHealthCheck(HealthCheck(Some(/_health),HTTP,0,None,5 seconds,8 seconds,10 seconds,3),Actor[akka://marathon/user/$u#1146763091]))) (mesosphere.marathon.health.MarathonHealthCheckManager:83)

sttts added a commit to sttts/marathon that referenced this issue Feb 18, 2015
Timestamp's hashCode was based on the parameter dataTime which might have
a timezone != UTC. The consequence:

  val t = Timestamp.now
  t.hashCode != Timestamp(t.toString).hashCode

This patch overrides hashCode to use the normalize time variable of the
Timestamp case class.

This probably fixes d2iq-archive#1082
sttts added a commit to sttts/marathon that referenced this issue Feb 18, 2015
Timestamp's hashCode was based on the parameter dataTime which might have
a timezone != UTC. The consequence:

  val t = Timestamp.now
  t.hashCode != Timestamp(t.toString).hashCode

This patch overrides hashCode to use the UTC normalized time variable of the
Timestamp case class.

This probably fixes d2iq-archive#1082
@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

@bobrik Please test the upper patch. I think this fixes your issue.

sttts added a commit to sttts/marathon that referenced this issue Feb 18, 2015
Timestamp's hashCode was based on the parameter dataTime which might have
a timezone != UTC. The consequence:

  val t = Timestamp.now
  t.hashCode != Timestamp(t.toString).hashCode

This patch overrides hashCode to use the UTC normalized time variable of the
Timestamp case class.

This probably fixes d2iq-archive#1082
@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

I'm running marathon in UTC, but I'll give it a try.

@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

I think it's about the chronology parameter in general inside of DateTime. There might be other chronologies which are actually equivalent to UTC, but still giving a different hashCode. Let's see, only a theory for now.

In my testing here, I did this:

val t = Timestamp.now
t.hashCode
Timestamp(t.toString).hashCode

The two hash codes differed.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

OMG IT WORKED! Beers are on me next time I meet you in person 🍺

@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

:-)

@ConnorDoyle
Copy link
Contributor

Nice find @sttts

@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

Wondering how many other strange behaviors were triggered by this bug.... a map which isn't a map.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

I wonder why it worked in the first place, with 1-2 apps it works pretty well.

@bobrik
Copy link
Contributor Author

bobrik commented Feb 18, 2015

Is is because appHealthChecks hash rebuilds (more buckets, app versions in different buckets) when more apps are introduced?

@sttts
Copy link
Contributor

sttts commented Feb 18, 2015

Yes, can be. I guess Scala's Map implementation uses equal in the same bucket. And equal was correct. So Timestamps sometimes were identified, sometimes they were not.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants