Nomad fails to restart on windows nodes with "funlock error: The segment is already unlocked" error #10086

gulavanir · 2021-02-24T17:33:06Z

Nomad version

1.0.1

Operating system and Environment details

Windows

Issue

Nomad is unable to restart with following error

2021-02-24T17:09:51.874Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\ProgramData\nomad\plugins
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2021-02-24T17:09:51.878Z [INFO] client: using state directory: state_dir=C:\ProgramData\nomad\client
2021/02/24 17:09:56.896817 bolt.Close(): funlock error: The segment is already unlocked.

Reproduction steps

There isn't a specific scenario that leads to this state but we have seen this happening on multiple clusters which are long running, specifically on the windows nodes. The nomad service on windows nodes goes in loop trying to restart and failing with the same error.

Removing the client state db file and restarting the nomad solves the issue but needs manual intervention and the issue can reoccur too.

The text was updated successfully, but these errors were encountered:

shoenig · 2021-02-24T19:09:06Z

Hi @gulavanir, thanks for reporting.

Looks like this is probably a very old bug in the version of boltdb we use (v1.3.1 from June 2017). When Nomad agent stops it will call Close on the store, sometimes running into the bug described in etcd-io/bbolt#60.

I'm not sure what the implications of migrating to the etcd-io fork of boltdb are, we'll need to go through and investigate any other possible issues before making a switch (or cherry-picking the one fix).

idrennanvmware · 2021-02-26T16:39:20Z

thanks for looking in to this @shoenig !!

For some context, about 75% of nodes in a cluster (for us) are actually windows nodes. So we expect to see this more and more as we roll things out. If there's any information you'd like us to gather that would help in any way, please let us know what is most useful.

Thanks!

raunak2004 · 2021-03-09T02:15:51Z

We ran into the same issue on one of our clusters managed to run 'nomad operator debug -duration=5m -interval=30s -server-id=all -node-id=all' attaching the debug archive here:

nomad-debug-2021-03-09-014419Z.tar.gz

Thanks,
Raunak

tgross · 2021-03-12T18:51:06Z

Hey @idrennanvmware which version of Windows are you deploying on? That might help us narrow down the behavior.

gulavanir · 2021-03-12T20:31:57Z

Hey @tgross we are using "Windows server core 2019 LTS"

notnoop · 2021-06-02T12:34:24Z

@raunak2004 @gulavanir Sorry for the long delay. I'm investigating this issue further now. The following will be immensely helpful for debugging the issue for the next time it happens:

The full logs of the agent after failed restart. From the sample, I would have expected ==> Starting Nomad agent... and ==> Error starting agent lines
- If those lines are dropped somehow in the log collection pipeline, I'd suggest starting nomad agent manually on the host, e.g. nomad agent -config c:\....\config.hcl
- Sadly, nomad debug will not capture these initial logs - the debug command is useful for collecting logs on a running agent and isn't so useful for errors that generate very early on in the agent lifecycle or ones that cause it to crash.
The "corrupted" state.db file from the client data-dir directory, if it doesn't contain super sensitive info. You can send it to nomad-oss-debug@hashicorp.com or our Enterprise support.

Thank you so much for your patience.

schmichael · 2021-09-17T17:55:04Z

As per Seth's original comment it does seem like this has been fixed upstream, however both Nomad and Consul (as of 1.1.0) are still on the unmaintained original boltdb package. We've found no evidence that switching could cause issues, but we've also not been able to test extensively.

notnoop · 2021-10-18T15:07:10Z

Want to clarify the etcd-io/bbolt issue, as I don't think it applies here. The issue etcd-io/bbolt#60 , fixed in etcd-io/bbolt#64 , was a regression in etcd-io/bbolt introduced in https://github.com/etcd-io/bbolt/pull/35/files#diff-1efe626cba25e2ad261690300496ba6e83057a4d24a338d5559e6f01f2fb770bR62-R72 . One of dangers danger of actively developed project! The original boltdb library uses db.lockfile.Fd() properly for locking in https://github.com/boltdb/bolt/blob/v1.3.1/bolt_windows.go#L76-L90 .

Also, I suspect the the failure might be a red-harring or a secondary factor. The funlock error: The segment is already unlocked log line is generated on Raft DB shutdown when the agent startup is already aborted. The triggering cause for agent shutdown is typically included in the "Error starting agent" log line.

By trying to start the Nomad agent manually, I hope all Nomad output will be visible and pin point us to the problem/solution.

notnoop · 2021-10-20T22:09:22Z

I dug into this further, and I think I have a working theory. I believe there are at least three contributing factors and I was able to reproduce some of the symptoms though not a full reproduction of the problem.

Theory

My current theory is that this is a metastable failure, where Nomad crashed for unexpected reasons (e.g. memory overload, segmentation fault, etc); but without releasing some OS locks. Then Nomad fails to restart until the OS files locks are released or files are deleted.

Thefunlock log error is a symptom of the problem but not the trigger condition.

Details

The easy one is the mystery of missing log indicating why the agent fails to start. Until #11353, we don't capture the error in the log files/syslog. That explains why your log captures are missing this line.

The second issue is Windows file locking behavior: Boltdb, our storage engine, uses LockFileEx API for locking files. However, if we don't cleanly free lock before termination, the file lock may be held which prevents Nomad restart. I can replicate some failure cases where the Nomad process doesn't invoke UnlockFileEx on failure. The documentation states:

If a process terminates with a portion of a file locked or closes a file that has outstanding locks, the locks are unlocked by the operating system. However, the time it takes for the operating system to unlock these locks depends upon available system resources. Therefore, it is recommended that your process explicitly unlock all files it has locked when it terminates. If this is not done, access to these files may be denied if the operating system has not yet unlocked them.

If Nomad process restarts and tries to grab the lock, it may fail to do so until the OS unlocks file. Deleting the lock file manually would help I suspect.

Thirdly, when boltdb fails to gets the lock, it invokes funlock/UnlockFileEx in https://github.com/boltdb/bolt/blob/v1.3.1/db.go#L178-L189 and https://github.com/boltdb/bolt/blob/v1.3.1/db.go#L425-L428 . However, the funlock will fail, because it didn't grab the lock in the first place, and generate the spurious event.

Ways to validate

The following info can validate whether the theory is likely:

On failure, deleting the lock file (but not rest of data dir) then restarting agent. If the agent starts up, it's probably a locking issue.
restarting the agent with logging: Log the cause behind agent startup failure #11353 fix applied - it will highlight cause of startup failure much better.
Checking how long each restart takes. Currently, the flock call has a 5 second time, and I suspect that's how long the agent takes before it dies. Though, it may be longer due to other launch processes.

We still need to identify the cause of initial Nomad crash - though we must ensure that file locks are always freed so Nomad can recovery well.

I will provide more documentation/details tomorrow hopefully with reproduction steps.

notnoop · 2021-10-21T19:35:12Z

Following up to report my reproduction. I have confirmed that funlock error gets generated when Nomad fails to lock the file.

On a Windows machine, I started two clients on separate terminals. I started a server/client agent with the following command:

nomad agent -server -client -bootstrap-expect=1 -data-dir $env:temp\data

Then on a second PowerShell terminal, I started a client using the same data-dir that should fail because the first agent holds the lock on data-dir:

nomad agent -client -data-dir $env:temp\data
==> No configuration files loaded
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to initialize client: failed to open state database: timed out while opening database, is another Nomad process accessing data_dir C:\Users\Mahmood\AppData\Local\Temp\data\client?
    2021-10-21T14:27:02.295-0400 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=C:\Users\Mahmood\AppData\Local\Temp\data\plugins
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.308-0400 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2021-10-21T14:27:02.309-0400 [INFO]  client: using state directory: state_dir=C:\Users\Mahmood\AppData\Local\Temp\data\client
    2021/10/21 14:27:07.319398 bolt.Close(): funlock error: The segment is already unlocked.

Note that the actual error above is "failed to open state database: timed out while opening database, is another Nomad process accessing data_dir C:\Users\Mahmood\AppData\Local\Temp\data\client?" but it gets skipped from log files because of #11353 issue.

Additionally, I've tried to trigger crashed-without-freeing lock issue independently without luck. I've run the following program in a loop while it simulate crashing but without luck.

I've run the following but without having flock call fail:

go build .
ForEach($n in 1..100000) { .\winflock.exe -iter $n 2>&1 | tee -a output}

main.go

package main

import (
"flag"
"fmt"
"log"
"math/rand"
"os"
"time"

"github.com/boltdb/bolt"

)

func main() {
path := flag.String("path", "./db", "path to database")
iter := flag.Int("iter", -1, "iteration count")
wait := flag.Duration("wait", 0, "time before shutting down")
flag.Parse()

if *iter == -1 {
	rand.Seed(time.Now().Unix())
	*iter = rand.Intn(32)
}
log.SetPrefix(fmt.Sprintf("iter=%03d ", *iter))

log.Printf("starting path=%s wait=%s", *path, *wait)
db, err := bolt.Open(*path, 0600, &bolt.Options{Timeout: 5 * time.Second})
log.Printf("bolt.Open completed err=%v", err)
if err != nil {
	return
}

// simulate an unexpected crash without cleanup
if *iter%8 == 0 {
	log.Printf("CRASHING")
	os.Exit(2)
}

if *wait > 0 {
	time.Sleep(*wait)
}

db.Close()
log.Printf("stopping")

}

When this issue happens again, can you also inspect processes and see if there is a rogue Nomad agent that still holds the file lock on datadir?

shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/dependencies Pull requests that update a dependency file type/bug labels Feb 24, 2021

shoenig added this to Needs Triage in Nomad - Community Issues Triage via automation Feb 24, 2021

shoenig moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Feb 24, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

tgross mentioned this issue Mar 23, 2021

failed to contact Windows Server #10209

Closed

tgross added the hcc/cst Admin - internal label Apr 23, 2021

tgross added the theme/platform-windows label Apr 30, 2021

tgross mentioned this issue Apr 30, 2021

Nomad agent process is not exiting on fatal errors #10486

Open

tgross self-assigned this May 13, 2021

notnoop assigned notnoop and unassigned tgross Jun 1, 2021

tgross unassigned notnoop Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad fails to restart on windows nodes with "funlock error: The segment is already unlocked" error #10086

Nomad fails to restart on windows nodes with "funlock error: The segment is already unlocked" error #10086

gulavanir commented Feb 24, 2021 •

edited by notnoop

Loading

shoenig commented Feb 24, 2021

idrennanvmware commented Feb 26, 2021

raunak2004 commented Mar 9, 2021

tgross commented Mar 12, 2021

gulavanir commented Mar 12, 2021

notnoop commented Jun 2, 2021 •

edited

Loading

schmichael commented Sep 17, 2021

notnoop commented Oct 18, 2021

notnoop commented Oct 20, 2021 •

edited

Loading

notnoop commented Oct 21, 2021

Nomad fails to restart on windows nodes with "funlock error: The segment is already unlocked" error #10086

Nomad fails to restart on windows nodes with "funlock error: The segment is already unlocked" error #10086

Comments

gulavanir commented Feb 24, 2021 • edited by notnoop Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

shoenig commented Feb 24, 2021

idrennanvmware commented Feb 26, 2021

raunak2004 commented Mar 9, 2021

tgross commented Mar 12, 2021

gulavanir commented Mar 12, 2021

notnoop commented Jun 2, 2021 • edited Loading

schmichael commented Sep 17, 2021

notnoop commented Oct 18, 2021

notnoop commented Oct 20, 2021 • edited Loading

Theory

Details

Ways to validate

notnoop commented Oct 21, 2021

gulavanir commented Feb 24, 2021 •

edited by notnoop

Loading

notnoop commented Jun 2, 2021 •

edited

Loading

notnoop commented Oct 20, 2021 •

edited

Loading