Health status not reflected in REST API until some time after start #1270

sdwr98 · 2015-03-04T14:07:33Z

Hey all,

I'm seeing an issue on my Marathon 0.8.0/Mesos 0.21.1 cluster during deployment of a new version of a running application (or, sometimes, during deployment of a new application).

When I do the deployment, I see that Marathon waits until the new task becomes healthy before killing the old task, but there is a period of time when the new task reports that it has no successful health checks even though it's the only one running. I'll give an example timeline below:

08:45: Deployment of new application. The "tasks" section of /v2/apps/ returns this:

"tasks": [
  {
    "id": "app_identityportal.3e45b599-c1fd-11e4-a406-06bbba6a4180",
    "host": "mesosnode2-aws-west.motus.com",
    "ports": [
      31912
    ],
    "startedAt": "2015-03-03T23:30:24.565Z",
    "stagedAt": "2015-03-03T23:30:14.360Z",
    "version": "2015-03-03T23:30:10.622Z",
    "appId": "/app/identityportal",
    "healthCheckResults": [
      {
        "alive": true,
        "consecutiveFailures": 0,
        "firstSuccess": "2015-03-03T23:34:10.437Z",
        "lastFailure": "2015-03-04T03:07:22.284Z",
        "lastSuccess": "2015-03-04T13:45:27.279Z",
        "taskId": "app_identityportal.3e45b599-c1fd-11e4-a406-06bbba6a4180"
      }
    ]
  },
  {
    "id": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180",
    "host": "mesosnode5-aws-west.motus.com",
    "ports": [
      31000
    ],
    "startedAt": null,
    "stagedAt": "2015-03-04T13:45:22.330Z",
    "version": "2015-03-04T13:45:22.121Z",
    "appId": "/app/identityportal"
  }
],

As you can see, there's a new task on mesosnode5 that doesn't have any health check results yet.

08:48:23 - App becomes healthy according to marathon logs:

08:47:42.372 host=mesosmaster1-aws-west tag=marathon[22493]: [INFO] [03/04/2015 05:47:42.274] [marathon-akka.actor.default-dispatcher-112] [akka://marathon/user/$Ab] Received health result: [Unhealthy(app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180,2015-03-04T13:45:22.121Z,AskTimeoutException: Ask timed out on [Actor[akka://marathon/user/IO-HTTP#-23040394]] after [20000 ms],2015-03-04T13:47:42.274Z)] Context
08:48:23.118 host=mesosmaster1-aws-west tag=marathon[22493]: [INFO] [03/04/2015 05:48:23.021] [marathon-akka.actor.default-dispatcher-82] [akka://marathon/user/$Ab] Received health result: [Healthy(app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180,2015-03-04T13:45:22.121Z,2015-03-04T13:48:23.021Z)] Context
08:48:23.288 host=mesosmaster1-aws-west tag=marathon[22493]: [INFO] [03/04/2015 05:48:23.021] [marathon-akka.actor.default-dispatcher-82] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/f569b469-c11b-46a6-a60e-22d7015df27f/$a] Killing old task app_identityportal.3e45b599-c1fd-11e4-a406-06bbba6a4180 because app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180 became reachable Context

However, at 08:48:35, a call to /v2/apps/app/identityportal returns this for the "tasks" section:

"tasks": [
  {
    "id": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180",
    "host": "mesosnode5-aws-west.motus.com",
    "ports": [
      31000
    ],
    "startedAt": "2015-03-04T13:46:50.550Z",
    "stagedAt": "2015-03-04T13:45:22.330Z",
    "version": "2015-03-04T13:45:22.121Z",
    "appId": "/app/identityportal",
    "healthCheckResults": [
      null
    ]
  }
],

As you can see, there is no information in the health check results. It's not until 08:49:16 that the API call gets health check information:

"tasks": [
  {
    "id": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180",
    "host": "mesosnode5-aws-west.motus.com",
    "ports": [
      31000
    ],
    "startedAt": "2015-03-04T13:46:50.550Z",
    "stagedAt": "2015-03-04T13:45:22.330Z",
    "version": "2015-03-04T13:45:22.121Z",
    "appId": "/app/identityportal",
    "healthCheckResults": [
      {
        "alive": true,
        "consecutiveFailures": 0,
        "firstSuccess": "2015-03-04T13:49:10.412Z",
        "lastFailure": null,
        "lastSuccess": "2015-03-04T13:49:10.412Z",
        "taskId": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180"
      }
    ]
  }
],

This is causing problems because we have a process that synchronizes our load balancer with Marathon, and that process thinks that there are no healthy tasks available for a period of a couple of minutes. I have not yet tried 0.8.1, but I didn't immediately see any issues in the fix list that would apply to this situation.

The text was updated successfully, but these errors were encountered:

sdwr98 · 2015-03-09T16:29:15Z

Just an update to this issue - reducing the reconciliation_interval on the marathon masters reduced the duration of time that the tasks were without health check results, but did not completely eliminate the problem.

sttts · 2015-03-09T16:34:23Z

Hi Scott,

in 0.8.1 we fixed a bug which pretty much sounds like what you describe: health check results are lost, although health checks are running, until reconciliation when a second health check copy is started. Then the health check results are processes correctly.

Please take a look at the latest 0.8.1 rc build. I am pretty confident that the problem will disappear with the new release.

Regards,
Stefan

Am 09.03.2015 um 17:29 schrieb Scott Rankin notifications@github.com:

Just an update to this issue - reducing the reconciliation_interval on the marathon masters reduced the duration of time that the tasks were without health check results, but did not completely eliminate the problem.

—
Reply to this email directly or view it on GitHub.

sttts · 2015-03-09T16:35:48Z

Here is the old bug: #1082

sdwr98 · 2015-03-09T17:56:37Z

Thanks @sttts - I've deployed 0.8.1 RC2 and confirmed that it does fix the issue. Thank you!

sdwr98 changed the title ~~Health status not reflected in REST API until reconciliation~~ Health status not reflected in REST API until some time after start Mar 4, 2015

sdwr98 closed this as completed Mar 9, 2015

timoreimann mentioned this issue Dec 17, 2015

By design? ApplicationOk returns true for health checks without results gambol99/go-marathon#118

Closed

d2iq-archive locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health status not reflected in REST API until some time after start #1270

Health status not reflected in REST API until some time after start #1270

sdwr98 commented Mar 4, 2015

sdwr98 commented Mar 9, 2015

sttts commented Mar 9, 2015

sttts commented Mar 9, 2015

sdwr98 commented Mar 9, 2015

Health status not reflected in REST API until some time after start #1270

Health status not reflected in REST API until some time after start #1270

Comments

sdwr98 commented Mar 4, 2015

sdwr98 commented Mar 9, 2015

sttts commented Mar 9, 2015

sttts commented Mar 9, 2015

sdwr98 commented Mar 9, 2015