[feature] gracefully handle corrupt startup state #1367

camerondavison · 2016-06-28T20:56:44Z

As is mentioned in #1170 , if any of the state files are corrupted when a nomad client start up then it will never be able to start.

I think that on start up it should be considered (best effort) to try and recover the state of the machine but in the event that state files are corrupted maybe it would be best to make sure and shutdown any tasks that are still running, and delete all of the allocation state information on disk. This way it will still start up, but let the scheduler figure out a new home for the task.

ghost · 2016-06-28T21:26:44Z

This would be neat! I experience this fairly occasionally when docker hangs and I have to force reboot an instance.

I've built a temporary work around where I've written health checks for backends that checks the state of docker and nomad, and, if either fail, let Google Compute Engine automatically destroy and reprovision the instance.

camerondavison · 2016-06-29T00:10:59Z

I would be interested to know if you continue seeing this after 0.4.0 comes
out because of #1357

On Tue, Jun 28, 2016, 4:26 PM George Robinson notifications@github.com
wrote:

This would be neat! I experience this fairly occasionally when docker
hangs and I have to force reboot an instance.

I've built a temporary work around where I've written health checks for
backends that checks the state of docker and nomad are healthy, and, if
either fail, let Google Compute Engine automatically destroy and
reprovision the instance.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1367 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAKGBYVKtnIC8qndmvslKZB4sbbsJ60Kks5qQZGZgaJpZM4JAh0G
.

flegler-zz · 2016-08-15T20:55:13Z

+1 for this feature.

Occasionally I see this issue with Nomad 0.4.0

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input

It turns out that some state.json file are empty

# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug  8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root    0 Aug  8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json

Deleting /nomad_directory/ helped to overcome the issue.

Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.

capone212 · 2016-08-17T11:40:25Z

+1
Experience the same problem

camerondavison · 2016-08-17T12:06:45Z

@flegler is that usually after a hard server crash?

ghost · 2016-08-17T12:09:06Z

@a86c6f7964 I also see with kill -9. It should be simple to do atomic writes - write to a tmp file then do a rename operation - since renaming a file on Linux (and most fs implementations) is an atomic operation.

camerondavison · 2016-08-17T12:27:36Z

Ya that is what #1357 was supposed to do. It just doesn't do an fsync which is why I was asking if it was on server crash. Makes me think we missed something

flegler-zz · 2016-08-17T21:26:08Z

@a86c6f7964 I am not sure, but it is possible that the nodes were shut down without ending Nomad first.

flegler-zz · 2016-08-18T23:22:26Z

Another issue today on 0.4.0 when the client did not want to start. This time not with JSON files. Anybody knows what could be wrong here?

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 2 error(s) occurred:

* alloc_runner snapshot includes a nil allocation
* alloc_runner snapshot includes a nil context

Deleting the Nomad data directory helped again. I kept a copy in case you need some information from it.

flegler-zz · 2016-08-19T21:05:49Z

Another variation today with task runner:

/path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 1 error(s) occurred:

* task runner snapshot include nil Task

This is happening in a cloud environment where services are usually not shut down before nodes/machines are shut down.

dadgar · 2016-08-25T23:28:47Z

@flegler Yes it would be awesome if you could post those up

schmichael · 2016-08-26T00:04:35Z

For now my initial work on this issue I'm going to do the following:

Handle restoring nil contexts properly as it's not an error restore a nil context
Add some sanity checks around state saving/restoring

If we determine there are unavoidable environmental factors that could lead to state files being truncated/corrupted while at rest (e.g. unclean shutdowns and restarts), then we have a plan on how to gracefully deal with them. However, we'd like to ensure there's no bugs with persisting state before we focus on defensive mitigations that could cover up solvable bugs.

Prevent interleaving state syncs as it could conceivably lead to empty state files as per #1367

dadgar · 2016-11-09T23:32:56Z

Dropped the milestone but keeping this open in case anyone hits any issue we can track it. 0.5.0 includes some potential cases of corruption and adds sanity checks. Hopefully those were all the issues!

yangzhares · 2016-12-01T02:42:39Z

hey guys,

i also got this issue in v0.4.1, task state.json is empty.
-rw-------. 1 root root 0 Nov 28 06:47 /var/lib/nomad/client/alloc/5074d3ec-e4b8-9477-46a0-fa187300f84c/task-86a1b907d54bf7010394bf316e183e67/state.json

systemd start log:
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal systemd[1]: Started Nomad.
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal systemd[1]: Starting Nomad...
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal nomad[3624]: Loaded configuration from /etc/nomad/client.hcl
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal nomad[3624]: ==> Starting Nomad agent...
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal nomad[3624]: ==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal nomad[3624]: * failed to decode state: unexpected end of JSON input
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: Unit nomad.service entered failed state.
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: nomad.service failed.

schmichael · 2016-12-01T18:18:21Z

@yangzhares Any chance you can upgrade to 0.5.0? That's where some fixes and sanity checks were introduced to address this.

pgporada · 2016-12-16T20:55:19Z

@schmichael I am encountering something related to this on Nomad v0.5.1

Nomad starts just fine.

[root@centos-7-cl1 ~]# systemctl status nomad
● nomad.service - Nomad Agent
   Loaded: loaded (/usr/lib/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-12-16 20:54:42 UTC; 1s ago
 Main PID: 10955 (sh)
   Memory: 2.4M
   CGroup: /system.slice/nomad.service
           ├─10955 /bin/sh -c /sbin/nomad agent -config /etc/nomad  >> /var/log/nomad/nomad.log 2>&1
           └─10956 /sbin/nomad agent -config /etc/nomad

Dec 16 20:54:42 centos-7-cl1 systemd[1]: nomad.service holdoff time over, scheduling restart.
Dec 16 20:54:42 centos-7-cl1 systemd[1]: Started Nomad Agent.
Dec 16 20:54:42 centos-7-cl1 systemd[1]: Starting Nomad Agent...

I can schedule jobs onto my clients, however, these messages still appear on the clients.

$ cat /var/log/nomad/nomad.log
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...

If I run these commands, I can stop the errors from appearing.

systemctl stop nomad
cd /opt/nomad/client/alloc/
rm -rf *
cd /opt/nomad/alloc/
rm -rf *
systemctl restart nomad

It would be really beneficial if the logs had a timestamp.

jmvacdmon · 2017-05-02T16:03:16Z

Same in nomad 0.5.4

Loaded configuration from /etc/nomad.d/nomad.hcl, /etc/nomad.d/nomad_client.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

2 error(s) occurred:
alloc_runner snapshot includes a nil allocation
alloc_runner snapshot includes a nil alloc dir

rm -R /opt/nomad/client/alloc/ && systemctl start nomad

schmichael · 2017-07-03T20:07:34Z

I'm closing this as 0.6 substantially changes the state persistence code. Nomad now uses a single boltdb file instead of a tree of JSON files. This made atomic state changes much easier and hopefully fixed corrupt state issues (along with some other minor improvements to persistence code in 0.6).

I've also added more verbose logging when state issues are encountered in #2763

Please don't hesitate to open new issues if state restore bugs continue in 0.6!

(I opted against trying to write more sophisticated cleanup/recovery code at this time as it seems more likely to cause more problems than it fixes until we know the exact form new persistence issues take.)

artkrz · 2018-01-04T12:49:52Z

Similar problems on:

$ nomad  -version
Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Here's a little script that helps restore the node quickly.

Please think twice before you run this on your box! Especially on production nodes.

#!/bin/bash

mount_points=`mount | grep /var/nomad/alloc | awk {'print $3'}`

echo "Stopping Nomad..."
systemctl stop nomad

echo "Kill processes using /var/nomad/alloc/..."
for pid in `lsof | grep '/var/nomad/alloc' | awk {'print $2'} | sort | uniq`
do
    kill -TERM $pid
done

nomad_procs=`pgrep -fc /usr/local/bin/nomad`
if [[ "$nomad_procs" == "0" ]]
then
    for mount_point in $mount_points
    do
        echo "Unmounting $mount_point"
        umount $mount_point
    done
fi

echo "Removing /var/nomad/alloc and /var/nomad/client..."
rm -r /var/nomad/alloc
rm -r /var/nomad/client

echo "Starting Nomad"
systemctl start nomad

github-actions · 2022-12-05T02:15:45Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

camerondavison mentioned this issue Jun 28, 2016

failed to decode state: unexpected end of JSON input #1170

Closed

dadgar added type/enhancement theme/client labels Aug 5, 2016

flegler-zz mentioned this issue Aug 15, 2016

unexpected end of JSON input after server crash hashicorp/consul#1221

Closed

dadgar added this to the v0.5.0 milestone Aug 24, 2016

dadgar assigned schmichael Aug 24, 2016

schmichael added a commit that referenced this issue Sep 2, 2016

Lock around saving state

8d9468d

Prevent interleaving state syncs as it could conceivably lead to empty state files as per #1367

schmichael mentioned this issue Sep 2, 2016

Prevent state file corruption and check state file sanity on save #1682

Merged

dadgar removed this from the v0.5.0 milestone Nov 9, 2016

schmichael closed this as completed Jul 3, 2017

schmichael added this to the v0.6.0 milestone Jul 3, 2017

jozef-slezak mentioned this issue Sep 15, 2018

[feature] pluggable storage #4681

Closed

jozef-slezak mentioned this issue Oct 1, 2018

WIP - Badger state DB #4739

Closed

jozef-slezak mentioned this issue Oct 19, 2018

Fix corrupt state after restart (missing key, bucket) #4807

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] gracefully handle corrupt startup state #1367

[feature] gracefully handle corrupt startup state #1367

camerondavison commented Jun 28, 2016

ghost commented Jun 28, 2016 •

edited by ghost

Loading

camerondavison commented Jun 29, 2016

flegler-zz commented Aug 15, 2016

capone212 commented Aug 17, 2016

camerondavison commented Aug 17, 2016 •

edited

Loading

ghost commented Aug 17, 2016 •

edited by ghost

Loading

camerondavison commented Aug 17, 2016

flegler-zz commented Aug 17, 2016

flegler-zz commented Aug 18, 2016

flegler-zz commented Aug 19, 2016 •

edited

Loading

dadgar commented Aug 25, 2016

schmichael commented Aug 26, 2016

dadgar commented Nov 9, 2016

yangzhares commented Dec 1, 2016

schmichael commented Dec 1, 2016

pgporada commented Dec 16, 2016 •

edited

Loading

jmvacdmon commented May 2, 2017 •

edited

Loading

schmichael commented Jul 3, 2017

artkrz commented Jan 4, 2018

github-actions bot commented Dec 5, 2022

[feature] gracefully handle corrupt startup state #1367

[feature] gracefully handle corrupt startup state #1367

Comments

camerondavison commented Jun 28, 2016

ghost commented Jun 28, 2016 • edited by ghost Loading

camerondavison commented Jun 29, 2016

flegler-zz commented Aug 15, 2016

capone212 commented Aug 17, 2016

camerondavison commented Aug 17, 2016 • edited Loading

ghost commented Aug 17, 2016 • edited by ghost Loading

camerondavison commented Aug 17, 2016

flegler-zz commented Aug 17, 2016

flegler-zz commented Aug 18, 2016

flegler-zz commented Aug 19, 2016 • edited Loading

dadgar commented Aug 25, 2016

schmichael commented Aug 26, 2016

dadgar commented Nov 9, 2016

yangzhares commented Dec 1, 2016

schmichael commented Dec 1, 2016

pgporada commented Dec 16, 2016 • edited Loading

jmvacdmon commented May 2, 2017 • edited Loading

schmichael commented Jul 3, 2017

artkrz commented Jan 4, 2018

github-actions bot commented Dec 5, 2022

ghost commented Jun 28, 2016 •

edited by ghost

Loading

camerondavison commented Aug 17, 2016 •

edited

Loading

ghost commented Aug 17, 2016 •

edited by ghost

Loading

flegler-zz commented Aug 19, 2016 •

edited

Loading

pgporada commented Dec 16, 2016 •

edited

Loading

jmvacdmon commented May 2, 2017 •

edited

Loading