Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Health status not reflected in REST API until some time after start #1270

Closed
sdwr98 opened this issue Mar 4, 2015 · 4 comments
Closed

Health status not reflected in REST API until some time after start #1270

sdwr98 opened this issue Mar 4, 2015 · 4 comments

Comments

@sdwr98
Copy link

sdwr98 commented Mar 4, 2015

Hey all,

I'm seeing an issue on my Marathon 0.8.0/Mesos 0.21.1 cluster during deployment of a new version of a running application (or, sometimes, during deployment of a new application).

When I do the deployment, I see that Marathon waits until the new task becomes healthy before killing the old task, but there is a period of time when the new task reports that it has no successful health checks even though it's the only one running. I'll give an example timeline below:

08:45: Deployment of new application. The "tasks" section of /v2/apps/ returns this:

"tasks": [
  {
    "id": "app_identityportal.3e45b599-c1fd-11e4-a406-06bbba6a4180",
    "host": "mesosnode2-aws-west.motus.com",
    "ports": [
      31912
    ],
    "startedAt": "2015-03-03T23:30:24.565Z",
    "stagedAt": "2015-03-03T23:30:14.360Z",
    "version": "2015-03-03T23:30:10.622Z",
    "appId": "/app/identityportal",
    "healthCheckResults": [
      {
        "alive": true,
        "consecutiveFailures": 0,
        "firstSuccess": "2015-03-03T23:34:10.437Z",
        "lastFailure": "2015-03-04T03:07:22.284Z",
        "lastSuccess": "2015-03-04T13:45:27.279Z",
        "taskId": "app_identityportal.3e45b599-c1fd-11e4-a406-06bbba6a4180"
      }
    ]
  },
  {
    "id": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180",
    "host": "mesosnode5-aws-west.motus.com",
    "ports": [
      31000
    ],
    "startedAt": null,
    "stagedAt": "2015-03-04T13:45:22.330Z",
    "version": "2015-03-04T13:45:22.121Z",
    "appId": "/app/identityportal"
  }
],

As you can see, there's a new task on mesosnode5 that doesn't have any health check results yet.

08:48:23 - App becomes healthy according to marathon logs:

08:47:42.372 host=mesosmaster1-aws-west tag=marathon[22493]: [INFO] [03/04/2015 05:47:42.274] [marathon-akka.actor.default-dispatcher-112] [akka://marathon/user/$Ab] Received health result: [Unhealthy(app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180,2015-03-04T13:45:22.121Z,AskTimeoutException: Ask timed out on [Actor[akka://marathon/user/IO-HTTP#-23040394]] after [20000 ms],2015-03-04T13:47:42.274Z)] Context
08:48:23.118 host=mesosmaster1-aws-west tag=marathon[22493]: [INFO] [03/04/2015 05:48:23.021] [marathon-akka.actor.default-dispatcher-82] [akka://marathon/user/$Ab] Received health result: [Healthy(app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180,2015-03-04T13:45:22.121Z,2015-03-04T13:48:23.021Z)] Context
08:48:23.288 host=mesosmaster1-aws-west tag=marathon[22493]: [INFO] [03/04/2015 05:48:23.021] [marathon-akka.actor.default-dispatcher-82] [akka://marathon/user/MarathonScheduler/$a/UpgradeManager/f569b469-c11b-46a6-a60e-22d7015df27f/$a] Killing old task app_identityportal.3e45b599-c1fd-11e4-a406-06bbba6a4180 because app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180 became reachable Context

However, at 08:48:35, a call to /v2/apps/app/identityportal returns this for the "tasks" section:

"tasks": [
  {
    "id": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180",
    "host": "mesosnode5-aws-west.motus.com",
    "ports": [
      31000
    ],
    "startedAt": "2015-03-04T13:46:50.550Z",
    "stagedAt": "2015-03-04T13:45:22.330Z",
    "version": "2015-03-04T13:45:22.121Z",
    "appId": "/app/identityportal",
    "healthCheckResults": [
      null
    ]
  }
],

As you can see, there is no information in the health check results. It's not until 08:49:16 that the API call gets health check information:

"tasks": [
  {
    "id": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180",
    "host": "mesosnode5-aws-west.motus.com",
    "ports": [
      31000
    ],
    "startedAt": "2015-03-04T13:46:50.550Z",
    "stagedAt": "2015-03-04T13:45:22.330Z",
    "version": "2015-03-04T13:45:22.121Z",
    "appId": "/app/identityportal",
    "healthCheckResults": [
      {
        "alive": true,
        "consecutiveFailures": 0,
        "firstSuccess": "2015-03-04T13:49:10.412Z",
        "lastFailure": null,
        "lastSuccess": "2015-03-04T13:49:10.412Z",
        "taskId": "app_identityportal.b4349fc1-c274-11e4-a406-06bbba6a4180"
      }
    ]
  }
],

This is causing problems because we have a process that synchronizes our load balancer with Marathon, and that process thinks that there are no healthy tasks available for a period of a couple of minutes. I have not yet tried 0.8.1, but I didn't immediately see any issues in the fix list that would apply to this situation.

@sdwr98 sdwr98 changed the title Health status not reflected in REST API until reconciliation Health status not reflected in REST API until some time after start Mar 4, 2015
@sdwr98
Copy link
Author

sdwr98 commented Mar 9, 2015

Just an update to this issue - reducing the reconciliation_interval on the marathon masters reduced the duration of time that the tasks were without health check results, but did not completely eliminate the problem.

@sttts
Copy link
Contributor

sttts commented Mar 9, 2015

Hi Scott,

in 0.8.1 we fixed a bug which pretty much sounds like what you describe: health check results are lost, although health checks are running, until reconciliation when a second health check copy is started. Then the health check results are processes correctly.

Please take a look at the latest 0.8.1 rc build. I am pretty confident that the problem will disappear with the new release.

Regards,
Stefan

Am 09.03.2015 um 17:29 schrieb Scott Rankin notifications@github.com:

Just an update to this issue - reducing the reconciliation_interval on the marathon masters reduced the duration of time that the tasks were without health check results, but did not completely eliminate the problem.


Reply to this email directly or view it on GitHub.

@sttts
Copy link
Contributor

sttts commented Mar 9, 2015

Here is the old bug: #1082

@sdwr98
Copy link
Author

sdwr98 commented Mar 9, 2015

Thanks @sttts - I've deployed 0.8.1 RC2 and confirmed that it does fix the issue. Thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants