Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

bashoshaun · 2016-12-12T16:42:45Z

A customer occasionally runs into a problem where their node will not start back up after it's been shutdown to perform a backup.

On startup, they see this kind of output in console.log:

2016-12-11 01:14:47.674 [error] <0.1611.0>@riak_kv_vnode:init:512 Failed to start riak_kv_eleveldb_backend backend for index 1381575766539369164864420818427111292018498732032 error: {db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}
2016-12-11 01:14:47.678 [notice] <0.1611.0>@riak:stop:43 "backend module failed to start."
2016-12-11 01:14:47.678 [error] <0.1611.0> gen_fsm <0.1611.0> in state started terminated with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184
2016-12-11 01:14:47.678 [error] <0.1611.0> CRASH REPORT Process <0.1611.0> with 1 neighbours exited with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in gen_fsm:terminate/7 line 600
2016-12-11 01:14:47.679 [error] <0.163.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.1611.0> exit with reason no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in context child_terminated
2016-12-11 01:14:47.679 [error] <0.160.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.1585.0> exit with reason {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1184}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,908}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_fsm,sync_send_event,[<0.1611.0>,wait_for_init,...]}} in context child_terminated
2016-12-11 01:14:47.683 [info] <0.500.0>@riak_kv_app:prep_stop:232 Stopping application riak_kv - marked service down.

I've attached the full console.log file for context.

The customer has also implemented some basic debugging in their startup script as it was initially thought a process may have been holding a lock on these
console.log.txt

files. This is an example of what they looked like when the node was up and running normally:

lsof:

...
beam.smp  177432           riak  mem       REG              253,8      32526336 1744834583 /data/riak/leveldb/1267395951122892374379757940871151681107879002112/000369.log
beam.smp  177432           riak  mem       REG              253,8      27365376 1879049262 /data/riak/leveldb/68507889249886074290797726533575766546371837952/000366.log
beam.smp  177432           riak  mem       REG              253,8      27631616  805307404 /data/riak/leveldb/125597796958124469533129165311555572001681702912/000367.log
beam.smp  177432           riak  mem       REG              253,8      32116736 1476398089 /data/riak/leveldb/1438665674247607560106752257205091097473808596992/000382.log
beam.smp  177432           riak  mem       REG              253,8      32694272 1476398103 /data/riak/leveldb/1039036320289938793410432185759232459286639542272/000382.log
...

After the stop, there are no leveldb entries in the output of lsof. Similarly, there are no beam processes running.

So when Riak restarts, there's no obvious reason why there should be a failure. When the customer then manually attempts a restart after realising the node is down, it comes up fine.

Customer is running RHEL7.

The text was updated successfully, but these errors were encountered:

bashoshaun · 2016-12-13T10:17:25Z

The customer would like to add the following:

Perhaps it could also mention that while no process has the LOCK files open after riak stop, they do persist in the file system, and also that the file system is xfs on thinly provisioned lvm logical volume, which is used to create a snapshot during the node stop.

Basho-JIRA changed the title ~~Riak sometimes fails to start due to LOCK files being held due to lock "already held by process"~~ Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] Dec 12, 2016

Basho-JIRA added the JIRA: To Do label Dec 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

bashoshaun commented Dec 12, 2016

bashoshaun commented Dec 13, 2016

Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118] #218

Comments

bashoshaun commented Dec 12, 2016

bashoshaun commented Dec 13, 2016